Main Content

Deep Learning with Big Data

Typically, training deep neural networks requires large amounts of data that often do not fit in memory. You do not need multiple computers to solve problems using data sets too large to fit in memory. Instead, you can divide your training data into mini-batches that contain a portion of the data set. By iterating over the mini-batches, networks can learn from large data sets without needing to load all data into memory at once.

If your data is too large to fit in memory, use a datastore to work with mini-batches of data for training and inference. MATLAB® provides many different types of datastore tailored for different applications. For more information about datastores for different applications, see Datastores for Deep Learning.

augmentedImageDatastore is specifically designed to preprocess and augment batches of image data for machine learning and computer vision applications. For an example showing how to use augmentedImageDatastore to manage image data during training, see Train Network with Augmented Images

Work with Big Data in Parallel

If you want to use large amounts of data to train a network, it can be helpful to train in parallel. Doing so can reduce the time it takes to train a network, because you can train using multiple mini-batches at the same time.

It is recommended to train using a GPU or multiple GPUs. Only use single CPU or multiple CPUs if you do not have a GPU. CPUs are normally much slower that GPUs for both training and inference. Running on a single GPU typically offers much better performance than running on multiple CPU cores.

For more information about training in parallel, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud.

Preprocess Data in Background

When you train in parallel, you can fetch and preprocess your data in the background. This can be particularly useful if you want to preprocess your mini-batches during training, such as when using the transform function to apply a mini-batch preprocessing function to your datastore.

When you train a network using the trainNetwork function, you can fetch and preprocess data in the background by enabling background dispatch:

During training, some workers are used for preprocessing data instead of network training computations. You can fine-tune the training computation and data dispatch loads between workers by specifying the 'WorkerLoad' name-value argument using the trainingOptions function. For advanced options, you can try modifying the number of workers of the parallel pool.

You can use a built-in mini-batch datastore, such as augmentedImageDatastore, denoisingImageDatastore (Image Processing Toolbox), or pixelLabelImageDatastore (Computer Vision Toolbox). You can also use a custom mini-batch datastore with background dispatch enabled. For more information on creating custom mini-batch datastores, see Develop Custom Mini-Batch Datastore.

For more information about datastore requirement for background dispatching, see Use Datastore for Parallel Training and Background Dispatching

Work with Big Data in the Cloud

Storing data in the cloud can make it easier for you to access for cloud applications without needing to upload or download large amounts of data each time you create cloud resources. Both AWS® and Azure® offer data storage services, such as AWS S3 and Azure Blob Storage, respectively.

To avoid the time and cost associated with transferring large quantities of data, it is recommended that you set up cloud resources for your deep learning applications using the same cloud provider and region that you use to store your data in the cloud.

To access data stored in the cloud from MATLAB, you must configure your machine with your access credentials. You can configure access from inside MATLAB using environment variables. For more information on how to set environment variables to access cloud data from your client MATLAB, see Work with Remote Data. For more information on how to set environment variables on parallel workers in a remote cluster, see Set Environment Variables on Workers (Parallel Computing Toolbox).

For an example showing how to upload data to the cloud, see Upload Deep Learning Data to the Cloud.

For more information about deep learning in the cloud, see Deep Learning in the Cloud

Preprocess Data for Custom Training Loops

When you train a network using a custom training loop, you can process your data in the background by using minibatchqueue and enabling background dispatch. A minibatchqueue object iterates over a datastore to prepare mini-batches for custom training loops. Enable background dispatch when your mini-batches require heavy preprocessing.

To enable background dispatch, you must:

  • Set the DispatchInBackground property of the datastore to true.

  • Set the DispatchInBackground property of the minibatchqueue to true.

When you use this option, MATLAB opens a local parallel pool to use for preprocessing your data. Data preprocessing for custom training loops is supported when training using local resources only. For example, use this option when training using a single GPU in your local machine.

For more information about datastore requirements for background dispatching, see Use Datastore for Parallel Training and Background Dispatching

See Also

| |

Related Topics