Optimize C/C++ Code Performance for Deep Learning Applications without Deep Learning Libraries
Vectorization and multithreading can improve the performance of embedded applications. You can use vectorization to execute the same instruction on multiple data elements simultaneously or multithreading to divide a workload into threads for concurrent execution across several cores. Both techniques allow processors to make more efficient use of available resources and complete tasks faster.
For deep learning networks that are resilient to precision loss, compressing learnables
from single-precision to bfloat16
data type greatly reduces memory usage
with little change in inference accuracy. This process does not require calibration data and
increases inference speeds. Any hardware that supports
single-precision floating-point data type can benefit from
bfloat16
.
Optimize the performance of C/C++ code generation by:
Enabling Single Instruction, Multiple Data (SIMD) intrinsics, see Generate SIMD Code from MATLAB Functions for Intel Platforms and Generate SIMD Code from MATLAB Functions for ARM Platforms (Embedded Coder)
Enabling OpenMP for processors that support multi-threading, see
EnableOpenMP
Enabling
bfloat16
compression of network learnables, see Generate bfloat16 Code for Deep Learning Networks
Code Generation for ARM Cortex Targets
With MATLAB Coder, you can enable vectorization through the use of SIMD intrinsics
available in the code replacement libraries for Cortex-M targets or by setting the
InstructionSetExtensions
property for ARM® Cortex-A targets.
To generate SIMD C code for a AMD Cortex-A and deploy the code to a Raspberry Pi® hardware, use these commands to set the
InstructionSetExtensions
property:cfg = coder.config('lib'); cfg.Hardware = coder.Hardware('Raspberry Pi'); cfg.InstructionSetExtensions = "Neon v7"; cfg.EnableOpenMP = true;
To generate code for generic a ARM Cortex-A target, use these commands to set the
InstructionSetExtensions
property:cfg = coder.config('lib'); cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-A'; cfg.InstructionSetExtensions = "Neon v7"; cfg.EnableOpenMP = true; cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
To generate code for generic a ARM Cortex-M target, use these commands:
cfg = coder.config('lib'); cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-M'; cfg.CodeReplacementLibrary = 'ARM Cortex-M'; cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
Code Generation for Intel and AMD
To generate code for x86 hardware, set the
InstructionSetExtensions
property to an instruction set extension
that your processor supports. If you use Embedded Coder, you can also select from the
instruction sets SSE, SSE4.1, AVX, AVX2, FMA, and AVX512F.
On a Intel® CPU, enable SIMD by setting the
InstructionSetExtensions
property to an instruction set extension that your processor supports. This example uses AVX512F for Linux.cfg = coder.config('lib'); cfg.HardwareImplementation.ProdHWDeviceType = 'Intel->x86-64 (Linux 64)'; cfg.InstructionSetExtensions = 'AVX512F'; cfg.EnableOpenMP = true; cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
On a AMD CPU, enable SIMD by setting the
InstructionSetExtensions
property to an instruction set extension that your processor supports. This example uses AVX512F for Linux.cfg = coder.config('lib'); cfg.HardwareImplementation.ProdHWDeviceType = 'AMD->x86-64 (Linux 64)'; cfg.InstructionSetExtensions = 'AVX512F'; cfg.EnableOpenMP = true; cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');
Generate bfloat16
Code
Generate code with learnables compression in bfloat16
format by
setting the LearnablesCompression
property of your coder.DeepLearningCodeConfig
object dlcfg
,
dlcfg = coder.DeepLearningConfig(TargetLibrary = 'none'); dlcfg.LearnablesCompression = 'bfloat16';
Alternatively, in the MATLAB®
Coder™ app or the Configuration Parameters dialog box, on the Deep
Learning tab, set Target library to
none
. Then set the Learnables
Compression property to bfloat16
.