A Productive Journey to Deploy Tiny Neural Networks on Microcontrollers
By Danilo Pau, STMicroelectronics, and Brenda Zhuang, MathWorks
Machine learning and deep learning applications are being increasingly moved from the cloud to embedded devices close to where the data originates. With the edge computing market quickly expanding, several factors are driving growth in Edge AI, including scalability, increasing demand for real-time AI applications, and the availability of low-cost edge devices complemented by robust and productive software toolchains. Additionally, there is a need to avoid transmitting data over a network—either for security reasons or simply to minimize communication costs.
Edge AI encompasses a wide range of devices, sensors, microcontrollers, multi-microprocessors on a chip, application processors, and dedicated systems on chip—including relatively powerful edge servers and IoT modules. The reference community, the tinyML Foundation, established in 2019, is focused on developing machine learning models and deploying them on extremely resource-constrained embedded devices featuring limited memory, processing power, and energy consumption budgets. tinyML opens unique opportunities, including applications that can be powered with inexpensive batteries or even small solar panels, as well as large-scale applications that process data locally on low-cost hardware. Of course, tinyML also presents various challenges. One such challenge involves machine learning and embedded systems developers having to optimize an application’s performance and footprint, which requires proficiency in both AI and embedded systems.
In such a context, this article describes a practical framework for designing and deploying deep neural networks on edge devices. Based on MATLAB® and Simulink® products, along with STMicroelectronics® Edge AI tools, the framework helps teams quickly grow expertise in deep learning and edge deployment, enabling them to overcome common hurdles encountered with tinyML. This, in turn, empowers them to rapidly build and benchmark proof-of-concept tinyML applications. In the first steps of the workflow, teams use MATLAB to build a deep learning network, tune hyperparameters with Bayesian optimization, use knowledge distillation, and compress the network with pruning and quantization. In the final step, the developers use the ST Edge AI Core Technology integrated into the ST Edge AI Developer Cloud—a free online service for developing AI on STMicroelectronics 32-bit (STM32, Stellar) microcontrollers and microprocessors, including sensors equipped with integrated AI—to benchmark resource utilization and inference speed for the deployed deep learning network (Figure 1).
Network Design, Training, and Hyperparameter Optimization
Once engineers have gathered, preprocessed, and prepared the data set to be used in the deep learning application, the next step is training and evaluating candidate models, which can include pretrained models, such as NASNet, SqueezeNet, Inception-v3, and ResNet-101, or models the machine learning engineer built interactively using the Deep Network Designer app (Figure 2). Several models provide examples that can be used to jumpstart development, including sample models for image, video, sound, and lidar point cloud classification; object detection; pose estimation; and waveform segmentation.
The performance of a deep learning network is heavily dependent on both the parameters that govern its training and those that describe its network architecture. Examples of these hyperparameters include learning rate and batch size, as well as the number of layers, the type of layers, and the connections between layers. Proper hyperparameter tuning can lead to models that achieve higher accuracy and better performance, even in the resource-constrained environments in which tinyML applications run. However, selecting and fine-tuning hyperparameter values to find the combination that optimizes performance can be a difficult and time-consuming task.
Bayesian optimization is well-suited for hyperparameter optimization of both classification and regression deep learning networks because it efficiently explores the high-dimensional hyperparameter space to find optimal or near-optimal configurations. In MATLAB, the machine learning developer can use the bayesopt
function to find the best hyperparameter values using Bayesian optimization. For example, it can provide a set of hyperparameters to evaluate—such as the number of convolutional layers, initial learning rate, momentum, and L2 regularization—and an objective function, such as the validation error, to minimize. The function can then use the results from bayesopt
to select one or more sets of hyperparameter configurations to explore further in the next phases of the workflow.
Knowledge Distillation
Resource-constrained embedded devices have limited memory available. Knowledge distillation is one approach to reduce the footprint of a deep learning network while retaining a high level of accuracy. This technique uses a larger, more accurate teacher network to teach a smaller student network to make predictions. The key is to select loss functions in the teacher-student network architectures.
Networks trained in the earlier steps can be used as the teacher model. A student network is a smaller but similar version of the teacher model. Typically, the student model contains fewer convolution-batchnorm-ReLU blocks. To account for the dimension reduction, max pooling layers or global average pooling are added in the student network. These modifications significantly reduce the number of learnables compared to the teacher network.
A knowledge distillation loss function must be defined to train the student network. It is determined from the input of the student network, teacher network, input data with a corresponding target, and the temperature hyperparameter. Empirically, the loss function consists of a weighted average of 1) the hard loss, which is the cross-entropy loss between the student network outputs and the true label; and 2) the soft loss, which is the cross-entropy loss of the SoftMax with temperature between the student network logits and that from the teacher network.
The trained student network preserves accuracy from the teacher network better and achieves reduction in learnable parameters, making it more suitable for deployment to embedded devices.
Model Compression and Optimization
Efficient design and hyperparameter optimization during the training phase is an essential first step; however, it’s not enough to guarantee deployment on edge devices. Post-training optimization via model pruning and quantization is therefore important in further reducing the memory footprint and computational requirements of a deep neural network.
One of the most effective methods of network compression is quantization. This is because data is acquired at integer precision since no large volume sensors output floating point representations. With quantization, the goal is to reduce the memory footprint required to store the network’s parameters and increase computation speed by representing the model’s weights and activations with a reduced number of bits. This may involve, for example, replacing 32-bit floating-point numbers with 8-bit integers—again when it is possible to do so, accepting only a marginal degradation in the accuracy of predictions. Quantization allows parsimonious usage of the embedded memory, which is vital for resource-constrained sensors, microcontrollers, and microprocessors (Figure 3) at the edge. Additionally, integer operations are generally faster on hardware than floating-point operations, leading to inference performance lift on microcontrollers. This leads to models consuming less power, making them even more suitable for deployment on battery-powered or energy-constrained devices such as mobile phones and IoT devices. While post-training quantization is supposed to introduce some losses of precision, quantization tools in MATLAB are designed to minimize the impact on model accuracy. Techniques such as fine-tuning and calibration are used to maintain the performance of the quantized model. In MATLAB, the dlquantizer
function simplifies the process of quantizing the weights, biases, and activations of a deep neural network to 8-bit integer values.
Pruning techniques, in contrast, focus on reducing the complexity of a network by minimizing operational redundancy. This is fundamental to dramatically reduce computational complexity. The idea is to identify and remove those connections, weights, filters, or even entire layers that have little effect on the network’s predictions. Projection is a proprietary MATLAB technique used to optimize neural networks by selectively removing less important weights or connections. This process reduces the model’s complexity, resulting in a smaller model size and faster inference times without significantly compromising performance. While regular pruning generally involves straightforward threshold-based removal of low-magnitude weights, projection may incorporate more sophisticated criteria and methods to ensure that the network’s essential features are preserved. Additionally, projection often aims to maintain the geometric properties of the weight space, leading to potentially more efficient and robust models compared to traditional pruning methods.
Benchmarking on ST Edge AI Developer Cloud
After completing the initial network design, hyperparameter optimization, distillation, and compression in MATLAB, the next step in the workflow is to assess the performance of that design on a microcontroller or microprocessor. Specifically, engineers need to evaluate the flash and RAM requirements of the network as well as inference speed, among other factors.
ST Edge AI Developer Cloud is designed to streamline this stage of the workflow by enabling rapid benchmarking of networks on ST Edge devices. To use this service for a tinyML application developed in MATLAB, you first export the network to ONNX format. After uploading the generated ONNX file to ST Edge AI Developer Cloud, the engineers can select the ST device or devices on which to run the benchmarks (Figure 4).
Once the benchmarking is complete, ST Edge AI Developer Cloud provides a report detailing the results (Figure 5). The performance analysis tools provided by the ST Edge AI Developer Cloud offer a variety of detailed insights, including memory usage, processing speed, resource utilization, and model accuracy. Developers receive information on RAM and flash memory consumption, with a breakdown of memory allocation for different layers and components of the model. Additionally, the tools provide execution time for each layer and overall inference time, along with detailed timing analysis to identify and optimize slow operations. Resource utilization statistics, including CPU and hardware accelerator usage, as well as power consumption metrics, help in optimizing energy efficiency.
Following a review of the benchmark results, engineers can identify the best course of action for next steps. If the network design fits comfortably within the constraints of a given edge device with low inference times, they might explore opportunities to either use a tinier device or increase prediction accuracy with a somewhat larger, more complex network. On the other hand, if the network design is too big, resulting in slow inference times due to the use of external Flash or RAM, then the team might look for a more computationally powerful device with more embedded Flash and RAM, or they might perform additional hyperparameter optimization, knowledge distillation, pruning, and quantization iterations with MATLAB to compress the network further. The ST Edge AI Developer Cloud also offers automated code generation to streamline the deployment of AI models on ST devices. This feature converts trained AI models into optimized C code that is compatible with STMicroelectronics’ sensors, microcontrollers, and microprocessors.
From Benchmarking to Deployment
The final step in the workflow is deployment to either a sensor, microcontroller, or microprocessor. With benchmarking results in hand, the engineers can make an informed decision on selecting a platform, like an STM32 Discovery Kit for evaluation of their tinyML application on real hardware, for example.
Depending on the application, they may need to integrate the deep neural network with other components, such as a controller, and incorporate them into a larger system before deployment. For these use cases, they can extend the workflow further, modeling the other components in Simulink, running system-level simulations to validate the design, and generating C/C++ code for deployment to STM32 devices using Embedded Coder® and Embedded Coder support for STM32.
Published 2024