How can I optimize the performance of library-free C/C++ code generated from deep learning networks?

Question

Jack Ferrari on 23 Mar 2023

1
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1934305-how-can-i-optimize-the-performance-of-library-free-c-c-code-generated-from-deep-learning-networks

Edited: Jack Ferrari on 23 May 2024

I am generating code for a deep learning network with coder.DeepLearningConfig(TargetLibrary = 'none'). Which code generation configuration settings should I use to optimize the performance of the generated code?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Jack Ferrari on 23 Mar 2023

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1934305-how-can-i-optimize-the-performance-of-library-free-c-c-code-generated-from-deep-learning-networks#answer_1199720

Edited: Jack Ferrari on 23 May 2024

Open in MATLAB Online

Vectorization and multi-threading are techniques that can improve the performance of embedded applications. Both allow processors to make more efficient use of available resources and complete tasks faster, either by executing the same instruction on multiple data elements simultaneously (vectorization), or by dividing a workload into threads for concurrent execution across several cores (multi-threading).

With MATLAB Coder, you can take advantage of vectorization through the use of SIMD (Single Instruction, Multiple Data) intrinsics available in code replacement libraries for ARM Cortex-A and M targets. On Intel and AMD CPUs, enable SIMD with the AVX2 or AVX512 instruction set extensions. For processors that support multi-threading, enable OpenMP.

Additionally, as of R2023a, you can enable bfloat16 compression of network learnables. For deep learning networks that are resilient to precision loss, compressing learnables from single-precision to bfloat16 datatypes greatly reduces memory usage with little change in inference accuracy. This process does not require calibration data and also increases inference speeds. Any hardware that supports single-precision floating-point datatypes can benefit from bfloat16. For more information, please refer here.

Note: these settings are general guidelines. Depending on your specific application and hardware target, changes to additional configuration settings may lead to added performance.

Using MATLAB Coder

See the following page: Optimize C/C++ Code Performance for Deep Learning Applications without Deep Learning Libraries

Using GPU Coder

NVIDIA Jetson Board

>> cfg = coder.gpuConfig('dll');
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'tensorrt');
>> cfg.DeepLearningConfig.DataType = 'FP16'; % Requires CC greater than 5.3, except 6.1
>> cfg.Hardware = coder.Hardware('NVIDIA Jetson');
>> cfg.GpuConfig.EnableMemoryManager = true;
>> cfg.GpuConfig.ComputeCapability = '7.0'; % modify according to actual hardware

NVIDIA Desktop GPU

>> cfg = coder.gpuConfig('dll');
>> cfg.DeepLearningConfig = coder.DeepLearningConfig(TargetLibrary = 'tensorrt');
>> cfg.DeepLearningConfig.DataType = 'FP16'; % Requires CC greater than 5.3, except 6.1
>> cfg.GpuConfig.EnableMemoryManager = true;
>> cfg.GpuConfig.ComputeCapability = '7.0'; % modify according to actual hardware

Using Embedded Coder with Simulink

Select the Simulink model and run the following commands with set_param:

ARM Cortex-A Targets

>> set_param(gcs, 'ProdHWDeviceType', 'ARM Compatible->ARM Cortex-A');
>> set_param(gcs, 'CodeReplacementLibrary', "GCC ARM Cortex-A");
>> set_param(gcs, 'MultiThreadedLoops', 'on');
>> set_param(gcs, 'MaxStackSize', '20000'); 
>> set_param(gcs, 'DLTargetLibrary', 'none');
>> set_param(gcs, 'DLLearnablesCompression', 'bfloat16'); % use bfloat16 for best performance; set to 'none' if you intend to compare results from single-precision models

Intel Targets

>> set_param(gcs, 'ProdHWDeviceType', 'Intel->x86-64 (Windows64)'); % Windows
>> set_param(gcs, 'ProdHWDeviceType', 'Intel->x86-64 (Linux 64)'); % Linux
>> set_param(gcs, 'ProdHWDeviceType', 'Intel->x86-64 (Mac OS X)'); % Mac OS
>> set_param(gcs, 'InstructionSetExtensions', 'AVX512F'); % or 'AVX2' if 'AVX512F' is not available
>> set_param(gcs, 'MultiThreadedLoops', 'on');
>> set_param(gcs, 'MaxStackSize', '20000'); 
>> set_param(gcs, 'DLTargetLibrary', 'none');
>> set_param(gcs, 'DLLearnablesCompression', 'bfloat16'); % use bfloat16 for best performance; set to 'none' if you intend to compare results from single-precision models

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How can I optimize the performance of library-free C/C++ code generated from deep learning networks?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How can I optimize the performance of library-free C/C++ code generated from deep learning networks?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments