eml.data

This package provides utility methods for reading data.

data_dir

eml.data.data_dir() returns '/engine/data', which is the directory where the contents specified in the engine.yaml under dataBucket: and dataBucketSubdirectory: are mounted.

When running locally, eml.data.data_dir() returns None.

distribute

Each replica has read-only access to data mounted at eml.data.data_dir() (i.e. '/engine/data'). This directory contains the data from the fields dataBucket and dataBucketSubdirectory specified in engine.yaml. In order to prevent replicas from training on the exact same example, it is best to partition the data set into chunks.

eml.data.distribute takes a list of items to split into chunks and returns a stable subset of that list for each replica. It is best to shuffle before this operation during training so that each replica has uniform distribution of training examples.

Data Partition

data = [('image0.png', 'label0.png'), ('image1.png', 'label1.png'),
('image2.png', 'label2.png'), ('image3.png', 'label3.png')]
data_chunk = eml.data.distribute(data)
# Assuming you are running on 2 GPUs.
# On GPU 0, data_chunk = [('image0.png', 'label0.png'), ('image1.png', 'label1.png')]
# On GPU 1, data_chuck = [('image2.png', 'label2.png'), ('image3.png', 'label3.png')]

Since data transfer from cloud storage systems, such as AWS S3, is a major bottleneck when training, Engine ML uses a caching system. Therefore, after the first time, accessing a remote file can be as fast as accessing it on solid-state drive. Generally, the first access of a training sample occurs during the first epoch.

By setting prefetch=True in eml.data.distribute, you can start loading data into each replica's cache immediately. eml.data.distribute accepts an optional prefetch_func, which takes each item in data and returns a string, a list of strings, or a list of tuples, representing an absolute path. In the best case, setting prefetch=True can result in up to a 5x speedup of an epoch.

tip

For best performance, files should be in the order that they will be fed into the model. In other words, always shuffle training samples before calling eml.data.distribute(..., prefetch=True).

data = ['000000.png', '000001.png', '000002.png', '000003.png']
# prefetch_func returns a string
prefetch_func = lambda item: os.path.join('/engine/data/kitti/object/training/image2', item)
data_chunk = eml.data.distribute(data, prefetch=True, prefetch_func=prefetch_func)

or

data = ['000000', '000001', '000002', '000003']
# prefetch_func returns a list of strings
prefetch_func = lambda item: ['/engine/data/kitti/object/training/image2/%s.png' % item,
'/engine/data/kitti/object/training/label2/%s.txt' % item]
data_chunk = eml.data.distribute(data, prefetch=True, prefetch_func=prefetch_func)

even_distribute

eml.data.distribute can give uneven chunk size across replicas. If you require even chunk sizes, use eml.data.even_distribute. This is useful when each replica must have the same number of batches per epoch. eml.data.even_distribute works by sampling random elements from the chunk until all chunks have the same size. Therefore, it is not recommended to use on your validation set.

data = [('image0.png', 'label0.png'), ('image1.png', 'label1.png'),
('image2.png', 'label2.png'), ('image3.png', 'label3.png'),
('image4.png', 'label4.png'), ('image5.png', 'label5.png'),
('image6.png', 'label6.png')]
data_chunk = eml.data.even_distribute(data)
# Assuming you are running on 2 GPUs.
# On GPU 0, data_chunk = [('image0.png', 'label0.png'), ('image1.png', 'label1.png'),
# ('image2.png', 'label2.png'), ('image3.png', 'label3.png')]
# On GPU 1, data_chuck = [('image4.png', 'label4.png'), ('image5.png', 'label5.png'),
# ('image6.png', 'label6.png'), ('image5.png', 'label5.png')]

Since data transfer from cloud storage systems, such as AWS S3, is a major bottleneck when training, Engine ML uses a caching system. Therefore, after the first time, accessing a remote file can be as fast as accessing it on solid-state drive. Generally, the first access of a training sample occurs during the first epoch.

By setting prefetch=True in eml.data.even_distribute, you can start loading data into each replica's cache immediately. eml.data.even_distribute accepts an optional prefetch_func, which takes each item in data and returns a string, a list of strings, or a list of tuples, representing an absolute path. In the best case, setting prefetch=True can result in up to a 5x speedup of an epoch.

tip

For best performance, files should be in the order that they will be fed into the model. In other words, always shuffle training samples before calling eml.data.even_distribute(..., prefetch=True).

data = ['000000.png', '000001.png', '000002.png', '000003.png']
# prefetch_func returns a string
prefetch_func = lambda item: os.path.join('/engine/data/kitti/object/training/image2', item)
data_chunk = eml.data.even_distribute(data, prefetch=True, prefetch_func=prefetch_func)

or

data = ['000000', '000001', '000002', '000003']
# prefetch_func returns a list of strings
prefetch_func = lambda item: ['/engine/data/kitti/object/training/image2/%s.png' % item,
'/engine/data/kitti/object/training/label2/%s.txt' % item]
data_chunk = eml.data.even_distribute(data, prefetch=True, prefetch_func=prefetch_func)

input_dir

eml.data.input_dir() returns '/engine/inputs', which is the directory where the contents specified in the engine.yaml under inputDir: are mounted.

When running locally, eml.data.input_dir() returns None.

output_dir

eml.data.output_dir(all_replicas=False) returns the directory where this replica should write checkpoints and Tensorboard event files, and where all other outputs should be saved. By default, all_replicas=False, so each replica will get a different output directory to prevent replicas from overwriting each other. If all replicas will write unique files to the output directory, set all_replicas=True.

When running locally, eml.data.output_dir() returns None.