Launch your experiment with
Experiments are launched when you push your code to the
engine git remote.
First add and commit your build script and engine.yaml file to your tracked files.
This process generally takes around 5 minutes. When it completes, your job will be running.
git add engine.yaml engine_build.sh git commit -m "Added Engine ML files" git push engine branch-name
If you are just getting started and want to experiment on only 2 GPUs, you can 'lease' machines with
engine lease create.
Jobs that run on leased machines often launch in less than 2 minutes.
Monitor Job Launch Progress
When you push your code to the
engine git remote, you should see status updates about what is happening to your job printed to your terminal.
If engine encounters an error while launching your job, it will print out an explanation of what happened and how to fix it.
engine job status - 2018-09-19T23:29:38: Cloning 2018-09-19T23:29:38: ParseEngineFile 2018-09-19T23:29:38: Failed Job creation failed in state: ParseEngineFile Error message: Failure to parse engine.yaml file: Error in $.numGPUs: expected Int, encountered String Suggested fix: Correct any errors in your engine.yaml file.
If for some reason you are disconnected from the git server during your push, your job will still be processed.
You can view the state of your job by running
engine job status.
Correlate Metrics with Code
The metrics dashboard is useful for finding bottlenecks.
Engine ML makes it easy to correlate metrics with code using
Metrics annotations are not visible by default. Make sure to check the box at the top of the page to turn them on.
Slow operations like image preprocessing
eml.annotate(title='Data', comment='Start Preprocessing', tags=('data')) data = preprocess_data(data) eml.annotate(title='Data', comment='Finished Preprocessing', tags=('data'))
The beginning of each epoch
for epoch in range(max_epochs): for data, label in data_iterator(): train(data, label) eml.annotate(title='Epoch', comment=epoch)
The start of training and evaluation
eml.annotate(title='Starting training') for data, label in train_data.iterator(): train(data, label) eml.annotate(title='Starting evaluation') for data, label in eval_data.iterator(): evaluate(data, label)
Open Tensorboard in your browser
engine job tensorboard -
engine job tensorboard accepts multiple Job IDs, making it easy to compare models with your peers.
Keeping track of Job IDs can be tricky. You can organize your jobs with tags.
Tags work well with
engine job tensorboard with the
--tag flag. Multiple
-t flags are accepted.
engine job tag lateral-stick resnet50 engine job tag colorful-fastener resnet50 engine job tag reliable-switch resnet101 engine job tensorboard -t resnet50 -t resnet101
To delete tags, use the
-d flag to
engine job tag.
engine job tag -d lateral-stick resnet50
Stop your Experiment
If your python code completes or exits with an error, Engine ML will automatically shut down the job for you and shut down all of the associated GPU instances.
If you want to stop a job early, you can run
engine job stop JOB_ID.
engine job stop lateral-stick Stopping lateral-stick... Job stopped
Download Checkpoints and Events
When your experiment is complete, you can download your saved model and event files with
engine job output. See the documentation if you want to download individual files.
engine job output lateral-stick get '*' Downloading outputs for job lateral-stick... Downloading outputs matching "*" for job lateral-stick Saved to file "lateral-stick-1544679573.tar.gz"