eml

annotate

The metrics dashboard is useful for finding bottlenecks. Engine ML makes it easy to correlate metrics with code using eml.annotate().

Annotations

tip

Metrics annotations are not visible by default. Make sure to check the box at the top of the page to turn them on. Annotation-Box

Usage

Slow operations like image preprocessing
eml.annotate(title='Data', comment='Start Preprocessing', tags=['data'])
data = preprocess_data(data)
eml.annotate(title='Data', comment='Finished Preprocessing', tags=['data'])
The beginning of each epoch
for epoch in range(max_epochs):
for data, label in data_iterator():
train(data, label)
eml.annotate(title='Epoch', comment=epoch)
The start of training and evaluation
eml.annotate(title='Starting training')
for data, label in train_data.iterator():
train(data, label)
eml.annotate(title='Starting evaluation')
for data, label in eval_data.iterator():
evaluate(data, label)
To ensure all replicas reach the same milestones

By default, eml.annotate() only annotates for the master replica. To view annotations for all replicas, set all_replicas=True.

for epoch in range(max_epochs):
for train_data, train_label in train_data_iterator():
train(train_data, train_label)
eml.annotate(title='Finished Training for Epoch %s' % epoch, all_replicas=True)
validate(validation_data, validation_label)
eml.annotate(title='Finished Validation for Epoch %s' % epoch, all_replicas=True)

is_engine_runtime

If you ever need separate logic for running on Engine ML, you can always use eml.is_engine_runtime().

if eml.is_engine_runtime():
# Do something only if running on Engine ML

num_replicas

eml.num_replicas() returns an integer, representing how many GPUs you are running on.

preempted_handler

Training with spot or preemptible instances is significantly cheaper, but there is a small risk that your run could be preempted. With PyTorch or TensorFlow, use eml.preempted_handler(fn, *args, **kwargs) to automatically save a checkpoint or to perform any other task before your run shuts down if preemption occurs. If you are using the prefer option for preemptible, then you can use preempted_callback to save your progress and resume from where you left off when your run is restarted.

import os
import engineml.torch as eml
# Create a handler to automatically write a checkpoint when a run is preempted
def save_checkpoint(model, optimizer, checkpoint_path):
state = {
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
}
eml.save(state, checkpoint_path)
# Set the preempted checkpoint handler
eml.preempted_handler(save_checkpoint, net, opt, os.path.join(eml.data.output_dir(),
'preempted.pt'))
import os
import engineml.tensorflow as eml
import tensorflow as tf
saver = tf.train.Saver()
saver = eml.saver(saver)
with tf.Session(config=eml.session.make_distributed_config()) as sess:
# Set a handler to automatically save a model checkpoint if the run is preempted
eml.preempted_handler(saver.save, sess, os.path.join(eml.data.output_dir(), 'preempted'))

replica_id

eml.replica_id() returns an integer, between 0 and eml.num_replicas() - 1, representing which replica (i.e. GPU) the code is running on.

save

When using PyTorch, it is recommended to save your models with eml.save. eml.save accepts the same arguments as torch.save, but it guarantees that only one checkpoint is saved per eml.save(...) call, regardless of how many GPUs or model replicas there are.

import engineml.torch as eml
def save_checkpoint(model, optimizer, checkpoint_path):
state = {
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
}
eml.save(state, checkpoint_path)

saver

When using TensorFlow, it is recommended that you wrap any instance of tf.train.Saver with eml.saver. This guarantees that only one checkpoint is saved per tf.train.Saver().save(...) call, regardless of how many GPUs or model replicas there are.

import engineml.tensorflow as eml
import tensorflow as tf
saver = tf.train.Saver()
saver = eml.saver(saver)