Engine ML requires a file called
engine.yaml at the root of your project.
This file tells Engine ML how to build and run your code.
# The engine yaml api version apiVersion: "2.0.0" # Which version of Python to use. Versions 2.7 and 3.6 are supported pythonVersion: "2.7" # Which framework and version of that framework to use. # TensorFlow versions 1.3.0, 1.4.1, 1.5.1, 1.6.0, 1.7.1, 1.8.0, 1.9.0, 1.10.1, 1.11.0, 1.12.3, 1.13.2, 1.14.0 # are supported. # PyTorch versions 0.4.0, 0.4.1, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0 are supported tensorFlowVersion: "1.8.0" # pytorchVersion: "1.0.0" # Compresses gradients to 16-bit during gradient reduction. # This lowers the data sent over the network by 1/2, but may cause training # instability. gradientCompression: false # How many GPUs to run on (It is most efficient to use a multiple of 8) numGPUs: 1 # The engine build script. This script installs all dependencies # required for your project. # The path is relative to the root of your repository. build: engine_build.sh # Valid values are "k80" and "v100" gpuType: "k80" # What instance lifecycle type to use. Valid values are: # "never": runs on only dedicated instances # "always": runs on only preemptible instances # "prefer": allocates as many preemptible instances as possible, and dedicated instances when # those are unavailable. When instances are preempted, they are replaced by # preemptible instances when available (and dedicated instances as needed) and your job # will be restarted. preemptible: "prefer" # S3 bucket where the training data is stored. # If your data is hosted with Engine ML, Please contact us for the s3 bucket name dataBucket: my-data.my-company.com # Subdirectory of the S3 bucket containing data dataBucketSubdirectory: / # Input directory for restoring from checkpoints # From your code, you can get the directory where the data is mounted # with `eml.data.input_dir()` # Either set jobId OR bucket and bucketPath, but not both. inputDir: jobId: adaptive-strut # bucket: my-bucket.my-company.com # bucketPath: /checkpoints/ # The command to run. The path is relative to the root of your repository. commands: - python path/to/train.py --option foo
Multiple Projects in the same Repository
If you have multiple experiments in the same repository, you can use symlinks to select which experiment to run. See Our Examples Repository for an example of this.