Generate bfloat16 Code for Deep Learning Networks

Generate `bfloat16` Code for Deep Learning Networks

Deep learning networks use single-precision floating-point datatype to store information, such as input, weights, activations, etc. Each element stored in single-precision format takes 32 bits in computer memory. The memory footprint required to store a deep learning network is very large. The Brain Floating Point Format (bfloat16) is a truncated version of the single-precision floating-point format. It only occupies 16 bits in computer memory. bfloat16 preserves approximately the same number range as single-precision floating-point by retaining same number of exponent bits (8 bits). bfloat16 has reduced accuracy in the fraction because it has only 7 fraction bits leading to accuracy loss.

$Both bfloat16 and single-precision floating point have one sign bit and 8 exponent bits. bfloat16 has only 7 fraction bits while single-precision has 23 fraction bits.$

For Deep Learning models that are resilient to precision loss, compressing learnables from single-precision to bfloat16 greatly reduces memory usage with little change in accuracy. The process does not need data or preprocessing step and it also improves inference speed. This enables deployment of large deep learning networks to devices that have low computational power and less memory resources. Hardware that supports single-precision floating-point datatype now can use bfloat16, with no requirement of bfloat16 support from the processor. For example, bfloat16 learnables compression could be used on any ARM-M, ARM-A, Intel processors.

Learnable compression in bfloat16 format is only supported for generating generic C/C++ code (that does not depend on third-party libraries).

Supported Layers and Classes

You can perform learnables compression in bfloat16 format and generate generic C/C++ code for these layers:

Bidirectional LSTM layer (bilstmLayer (Deep Learning Toolbox))
Fully connected layer (fullyConnectedLayer (Deep Learning Toolbox))
Channel-wise convolution layer (groupedConvolution2dLayer (Deep Learning Toolbox))
Gated recurrent unit (GRU) layer (gruLayer (Deep Learning Toolbox))
Gated recurrent unit (GRU) projected layer (gruProjectedLayer (Deep Learning Toolbox))
Long short-term memory (LSTM) layer (lstmLayer (Deep Learning Toolbox))
LSTM projected layer (lstmProjectedLayer (Deep Learning Toolbox))

Generate Code

Generate code with learnables compression in bfloat16 format by setting the LearnablesCompression property of your coder.DeepLearningCodeConfig object dlcfg,

dlcfg = coder.DeepLearningConfig(TargetLibrary = 'none');
dlcfg.LearnablesCompression = 'bfloat16';

Alternatively, in the MATLAB^® Coder™ app or the Configuration Parameters dialog box, on the Deep Learning tab, set Target library to none. Then set the Learnables Compression property to bfloat16.

Usage Notes and Limitations

convolution2dLayer (Deep Learning Toolbox) does not support bfloat16 compression. Learnables are still stored in the single-precision data format and do not contribute to memory compression. However, when bfloat16 compression is enabled, the least significant 16 bits of the learnables of this layer is set to zero. This behavior causes the results of the inference computation involving these learnables to mimic bfloat16 precision.

When learnables compression is enabled, learnables of convolution 2d layer and channel-wise grouped convolution 2d layers are still stored in the single-precision data format, the least significant 16 bits of the learnables are set to zero.

References

[1] Google Cloud Blog. “BFloat16: The Secret to High Performance on Cloud TPUs.” Accessed January 26, 2023. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.