Optimize C/C++ Code Performance for Deep Learning Applications without Deep Learning Libraries

Vectorization and multithreading can improve the performance of embedded applications. You can use vectorization to execute the same instruction on multiple data elements simultaneously or multithreading to divide a workload into threads for concurrent execution across several cores. Both techniques allow processors to make more efficient use of available resources and complete tasks faster.

For deep learning networks that are resilient to precision loss, compressing learnables from single-precision to bfloat16 data type greatly reduces memory usage with little change in inference accuracy. This process does not require calibration data and increases inference speeds. Any hardware that supports single-precision floating-point data type can benefit from bfloat16.

Optimize the performance of C/C++ code generation by:

Enabling Single Instruction, Multiple Data (SIMD) intrinsics, see Generate SIMD Code from MATLAB Functions for Intel Platforms and Generate SIMD Code from MATLAB Functions for ARM Platforms (Embedded Coder)
Enabling OpenMP for processors that support multi-threading, see EnableOpenMP
Enabling bfloat16 compression of network learnables, see Generate bfloat16 Code for Deep Learning Networks

Code Generation for ARM Cortex Targets

With MATLAB Coder, you can enable vectorization through the use of SIMD intrinsics available in the code replacement libraries for Cortex-M targets or by setting the InstructionSetExtensions property for ARM^® Cortex-A targets.

To generate SIMD C code for a AMD Cortex-A and deploy the code to a Raspberry Pi^® hardware, use these commands to set the InstructionSetExtensions property:
```
cfg = coder.config('lib');
cfg.Hardware = coder.Hardware('Raspberry Pi');
cfg.InstructionSetExtensions = "Neon v7";
cfg.EnableOpenMP = true;
```

To generate code for generic a ARM Cortex-A target, use these commands to set the InstructionSetExtensions property:

cfg = coder.config('lib');
cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-A';
cfg.InstructionSetExtensions = "Neon v7";
cfg.EnableOpenMP = true;
cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');

To generate code for generic a ARM Cortex-M target, use these commands:

cfg = coder.config('lib');
cfg.HardwareImplementation.ProdHWDeviceType = 'ARM Compatible->ARM Cortex-M';
cfg.CodeReplacementLibrary = 'ARM Cortex-M';
cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');

Code Generation for Intel and AMD

To generate code for x86 hardware, set the InstructionSetExtensions property to an instruction set extension that your processor supports. If you use Embedded Coder, you can also select from the instruction sets SSE, SSE4.1, AVX, AVX2, FMA, and AVX512F.

On a Intel^® CPU, enable SIMD by setting the InstructionSetExtensions property to an instruction set extension that your processor supports. This example uses AVX512F for Linux.

cfg = coder.config('lib');
cfg.HardwareImplementation.ProdHWDeviceType = 'Intel->x86-64 (Linux 64)';
cfg.InstructionSetExtensions = 'AVX512F';
cfg.EnableOpenMP = true;
cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');

On a AMD CPU, enable SIMD by setting the InstructionSetExtensions property to an instruction set extension that your processor supports. This example uses AVX512F for Linux.

cfg = coder.config('lib');
cfg.HardwareImplementation.ProdHWDeviceType = 'AMD->x86-64 (Linux 64)';
cfg.InstructionSetExtensions = 'AVX512F';
cfg.EnableOpenMP = true;
cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'none');

Generate `bfloat16` Code

Generate code with learnables compression in bfloat16 format by setting the LearnablesCompression property of your coder.DeepLearningCodeConfig object dlcfg,

dlcfg = coder.DeepLearningConfig(TargetLibrary = 'none');
dlcfg.LearnablesCompression = 'bfloat16';

Alternatively, in the MATLAB^® Coder™ app or the Configuration Parameters dialog box, on the Deep Learning tab, set Target library to none. Then set the Learnables Compression property to bfloat16.