|Allreduce||A method for reducing the values and distributing the results to all replicas. The reduce operation is generally |
|NCCL||A library of standard collective GPU-to-GPU communication routines such as allreduce|
|MPI||A library containing standardized means of exchanging messages between multiple nodes and having routines such as allreduce|
|Node||A machine (with some number of GPUs) that runs your experiment|
|Rank||The process number across all nodes. An integer in [0, number of nodes]|
|Replica||A copy of the model and code. There is one replica per rank and each replica runs as a separate process. Processes communicate over MPI or NCCL.|
Sane Default Values
All calls to the Engine ML python library will return sane defaults when running locally. This prevents extra code branching that can cause hard to find bugs.
If you do need separate logic for running on Engine ML, you can always use
There are only a handful of steps required to scale your model from single-node to large-scale distributed training. The code for building a deep model is already very complex. We don't want to add to that complexity.
Below you will walk through the modifications required train your model on multiple machines.
Data Set Partitioning
Each replica has read-only access to data mounted at
This directory contains the data from the fields
dataBucketSubdirectory specified in
In order to prevent replicas from training on the exact same example, it is best to partition the data set into chunks.
eml.data.distribute accepts a list of items to split into chunks and returns a stable subset of that list for each replica.
It is best to shuffle before this operation during training so that each replica has uniform distribution of training examples.
eml.data.distribute can give uneven chunk size across replicas. If you require even chunk sizes, use
This is useful when each replica must have the same number of batches per epoch.
eml.data.even_distribute works by sampling
random elements from the chunk until all chunks have the same size. Therefore, it is not recommended to use on your validation set.
Engine ML caches data per-machine to make reads faster. Reshuffling data can invalidate this cache, so we recommend
Metadata operations on large directories (e.g.
glob.glob) can very slow!
Scale the Learning Rate
When training your model across multiple replicas, you are scaling the effective batch size. In order to compensate for this larger batch size, it is a best practice to also scale your optimizer's learning rate.
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
Synchronize Model Replicas
Since Engine ML runs each replica as a separate process, we need to synchronize the randomly initialized weights before training
Enable CUDA IPC
By default, TensorFlow greedily uses all GPU memory across all GPUs on a node.
A common workaround for this is setting the environment variable
Many PyTorch examples also recommend this method of selecting a GPU.
This has the unintended consequence of disabling CUDA IPC.
CUDA IPC is required for fast distributed training.
Engine ML provides two functions that enable IPC while disabling greedy GPU memory management:
- For TensorFlow use
- For PyTorch use