Description

Machine Learning for the Edge – Model Compression and Deployment

Overview

AI is no longer limited to powerful computing environments such as GPUs or high-end CPUs and is often integrated into systems with limited resources like patient monitoring, diagnostic systems in vehicles, and manufacturing equipment. Fitting AI onto hardware with limited memory and power supply requires deliberate trade-offs between size of model, accuracy, inference speed, and power consumption — and that process is still challenging in many frameworks for AI development.

With MATLAB and Simulink you can leverage the well-established model-based design workflow to bring your AI models to any edge device, taking advantage of automated code generation, datatype optimization and more.

Highlights

We will show a typical workflow for deploying AI models to edge devices:

Model Selection: Identify less complex models and neural networks that still achieve the required accuracy
Size Reduction: Tune the hyperparameters and select appropriate datatypes to generate a more compact model
Hardware-in-the-loop Tests: Validate the correctness, memory requirements and runtime performance of your models directly on the hardware

About the Presenters

Christoph Stockhammer holds a M.Sc. degree in Mathematics from the Technical University Munich with an emphasis on optimization. He joined The Mathworks in 2012 and works as an application engineer. His focus areas include Mathematics and data analytics, Machine & Deep Learning as well as the integration of MATLAB software components in other programming languages and environments.

Christoph Kammer is an application engineer at MathWorks Switzerland. He supports customers in many different industries in the areas of machine and deep learning, image and signal processing and deployment to embedded or enterprise systems. Christoph has a master’s degree in Mechanical Engineering from ETHZ and a PhD in Electrical Engineering from EPFL, where he specialized in control design and the control and modelling of electromechanical systems and power systems.

Recorded: 4 Oct 2022

Full Transcript

OK. So let's start with this webinar on machine learning for the Edge, model compression and deployment. So today, HAI, it can be said, in a way, it's many industries. So we see AI models, big or small, are being deployed now to many, many devices run on the Edge. For example, these cool smartwatches, which do a lot of metrics and do a lot of medical data analysis.

Hearing AIDS is a big field, where AI is a lot of promise, wireless, predictive maintenance, of course automotive. And, yeah, of course, now the question is we have these nice AI models. We trained them on our computers. And we should now get them onto these small embedded devices.

How can we go about this? Of course, the main problem is hardware constraints. So if we start out on a big cluster to train something, we have unlimited memory, essentially, and a lot of processing power, even on our local computers. My laptop has a GPU. It has a lot of RAM.

But once we go to a real embedded processor, of course, then everything is very, very different. Memory is very constrained, so sizes of kilobytes, usually. The processor is pretty slow. And we just simply cannot run huge AI models. And since AI models have a lot of parameters, they tend to be over parameterized. And there is a lot of potential to reduce the size of these models. But of course, we need the right techniques to do that.

So what is HAI, just to summarize this again, just to be very clear? So, usually, we have a data scientist. She trains some models. She maybe puts them on the cloud, trains them on the cloud, AWS or Azure. And that's not a problem. Now, the next step is to put it in an embedded device. And you just try. You put your model on the embedded device. And then the embedded software engineer shows up and tells you, well, the chip only has 500 kilobytes of memory. Your model is 5 megabytes. That doesn't work out. Make that smaller. So that's the main question, how can I make my AI model actually smaller? So that's what we're going to answer today.

Or to summarize it in one very nice picture, finally, why is HAI difficult? I think this really illustrates it well. All right, so I quickly want to go over the model compression workflow that we will talk about today, mostly talking about machine learning models. There are also techniques for neural networks. And we will have a few slides about that in the end. But the main part will focus on traditional machine learning models.

So the first step in the workflow is to actually determine hardware constraints because that defines everything. This really gives us the constraints on how big the model can be. And, yeah, this is usually the most important step because every model you train is useless if it doesn't fit on the hardware in the end.

Then, once you have your hardware constraints, you can start selecting some models that might fit your constraints. Then, once you train it, you simplify it. And this is, of course, again, an iterative process. So you select the model. You simplify. You see if performance is good. You see if the constraints are satisfied. You go back, maybe you train different type of parameters, you train with different data.

Then you can quantize multiple parameters. So you can go from double precision to single precision, or even int8. And then, finally, you deploy and integrate on your embedded device. So one, two, three, that's what we're going to do.

Step one, models have different complexities and come in different sizes. So just very qualitatively speaking, you can say decision trees, for example, they're usually among the least accurate models, especially for more complex tasks. But they're also not very big, and they run fast.

Linear models are similar. They're pretty small. And then when you go to shallow neural networks, or Kernel SVMs, or Gaussian Processes, then the accuracy tends to increase. But also the size and execution time tends to go up. And so we see, already, here, this can give us some very rough idea of which models might be suitable. And we also have a nice table in the end that gives us a good overview of what parameters those models have. And we will share that in the end. Of course, there's also deep neural networks. As I mentioned, there's special techniques for that. They are definitely far to the top right. And it's less the focus of today's presentation.

Now, step number two, once we have picked a model, we want to try to minimize the number of model parameters to reduce the size and to also reduce the interference time. So there's a few knobs we can tune. We could use more or less input features. We can tune size-relevant hyperparameters. And of course, the idea is to maximize our accuracy given our size constraint.

So there's a bunch of things we can do. We can do feature ranking to reduce the number of features we have to only take the most irrelevant features. We can use hyperparameter optimization, yeah, to kind of adjust the size-relevant hyperparameters. And then you already are a bit smaller. And finally, we can quantize our model.

So once we have trained our model, we can use, for example, this tool called Fixed-Point Designer in MATLAB and Simulink, which can be used to quantize from double position to fixed point. And of course, this, again, saves quite a lot of space. And then we end up with a pretty small model that hopefully fits on our embedded target and runs in the expected time.

So a bit more elaborate here, the model compression workflow for machine learning determine hardware constraints. Select initial model. Select features and tune hyperparameters is the Simplify Model step, then quantifying, and deploying.

To select an initial model, we're going to use the Classification or Regression learner, which is a nice graphical interface in MATLAB. We will show you that in more detail. Then we're going to show Feature Selection and Bayesian hyperparameter optimization and quantization. And finally, now, not on this slide, we're also going to show you how to deploy, actually.

The demo, we're going to use, so it's about an embedded AI in a hearing aid. Now, if you are actually from a hearing aid company, we are aware that this is a very, very simplified version of this. So please don't hate. But I think for the purpose of this time, it's going to be very illustrative.

So essentially, we have two scenes. We can have people sitting in a cafe. There's a lot of environment noise, a lot of voices around. And we want to detect that so we can switch to directional mode in our hearing aid. Or we have another scene which is a forest path and you a switch to all around mode in our hearing aid. So essentially, we have some acoustic scenes. We want to classify those acoustic scenes so we can then do something based on where we are.

Exactly. Thanks, Christoph. So before we go into the actual demo, I'd like to talk a bit about preprocessing, or rather, feature extraction techniques. So imagine you get your raw data from the hearing aid from a kind of a microphone, right? But you probably don't want to work on the raw data. Rather, what we want to do is employ some feature extraction techniques. And there could be a broad variety of those.

We decided for a thing called Wavelet Scattering. Now here's an illustration how that works. So to the left, we have the actual microphone. And it yields data in stereo format. So those two lines you can see, blue and red. And what wavelet scattering really means is we go through something like a bunch of filter operations, namely those are convolutional filters, like in a deep neural network or a convolutional neural network. And we have things like activation functions. A classic one is a relu layer.

And now, we can apply this to the incoming signal, which we could also do with an image-based application. The difference here is that those weights and biases of the convolutional filters are not fixed. They are not learned, but they are predefined. They are fixed. So we have a static network, if you want.

And we just feed them the signal through that. And this gives us a bunch of features we can extract from the raw data and feed into a classifier. Now, again, it's not the only way to get there. There's many possible techniques that you could use. It's just a thing that we tried out and it worked pretty well.

So you can probably start with only two layers in this network. But it depends on what requirements you have in terms of model accuracy, computers and resource restrictions, and so on. But in general, this works pretty well. And it even featured in the leaderboards in different data science competitions.

So a bit more details on how that works, so, basically, this works on different layers. So we have always different operations, like a scaling operation, that we apply to the original signal, which gives us layer one. And then, subsequently, we have also the application of wavelet filters of different filters defined by wavelets, wavelet functions.

So this allows us to decompose our signal at different layers. All right, we can go to any depth, but typically, you stop at two or three layers because this gives you already enough information. And from the raw audio signal, this way, we can derive how many features we want to, can go to any level.

Typically, as I said, you step up to 2 or 3, and this gives you enough information already. And we can treat this as a matrix of feature vectors. So this is a math function that we can employ to compute all features from the raw data.

With that being said, just a quick intro to wavelet scattering. I want to switch to MATLAB to show you how this works in practice. We have built a live script, so executable MATLAB notebook that just goes through the complete workflow of training and comparing different machine learning models, and finally, also deploying them to embedded hardware.

We can go through this chapter-wise. And initially, we start out with our original data set. And the original data set comprises 15 different scenes. Remember we want to boil it down to only 2 scenes, like directional and all around. But in the original data set, we had those 15 different categories. As usual, we separate our data set into training and testing data sets, so standard approach for any machine learning algorithm development. Now, those are already the features extracted by the wavelet scattering. So we're directly working no longer on the raw data, but rather the extracted feature matrices.

So, as I said, we could train our models interactively, as my colleague illustrated in the classification on our app. You can see a screenshot here. But for now, I just want to load a bunch of models that we already trained and saved to disk. And we can have something like a comparison of those initial models, and just run the next section. And we collect those results in the table for a clearer overview.

And we have two things. We can look at the model sizes in kilobytes. And we can look at the accuracies in how many classes were correctly identified. Remember we have 15 classes right now. And in total, we have eight different model types. We have a neural network with two layers. We have boosted trees. We have a random forest, so making bagged of trees, if you want. We have a K-Nearest Neighbor model. We have a linear discriminant model, logistic regression, a support vector machine, and a support vector machine, but not with a linear kernel, but to the Gaussian kernel.

And for all of those we use the WHOS command for a proxy of the expected memory consumption. Now, it's important to state that this is not the exact size. Those models will take on any given hardware. But it's probably a good enough estimate for the size they're going to have. And we can already see that the size is very dramatically ranging from 30 kilobytes for the logistic model, up until more than 40 megabytes for the Gaussian VM model. So there's, of course, dramatic differences in size.

When you look at the accuracies, we see that almost all models can reach more than 80% in accuracy. But with the leader, probably the neural network, and logistic regression, and the Gaussian support vector machine have the least accuracy. So again, remember we have 15 classes here that we want to separate.

Now, as my colleague already said, all that really interests us is a distinction between hearing scenarios, whether we want to listen to a specific counterpart, for example, in a cafe, or whether we want to really listen from sounds from all directions. In the former case, we probably want to apply some beamforming techniques, like listening to a specific direction. Whereas in the other case, obviously we want to listen to sounds that come from pretty much every direction.

So we just reframe this as a two-class problem. So we map those 15 categories to one of those two classes each, so either directional or all-around scenario. And then, what we do next is we just retrain again with those only two classes. And this time, I'm going to show you how this works in the app.

So I just launch the app. And, actually, what I'm doing is I'm launching a previous session. So this is something I did previously. I can save the session and just open it up again in the Learner app. And I'm going to run you quickly through the app and show some capabilities and features that you can use to compare, quickly compare, train different machine learning models.

So we have some visualizations that indicate the two glasses here. So you can see here, we have all-around and directional. And we see just the x and y-axis are two predictors. So remember, the predictors are coefficients from the wavelet scattering, in this case. What you already see is that the classes, of course, don't cleanly separate. So you can just put a hyperplane that separates the red from the blue dots. It's not that simple.

All, right, on the left, we can see a bunch of already trained models, like bagged trees, support vector missions, linear discriminant, and whatnot. And we see the corresponding accuracies on the validation data set. We could also look at this on a test data set. So currently, the leader seems to be a K-Nearest Neighbor model.

I'd like to point out a few more capabilities here so we can train additional models from our model gallery here. So there's a lot of models we did not yet try out. We can also select to train just every model that is available. We can also train them in parallel, like leveraging this button if you have multiple cores, or even a cluster. We can tune some model hyperparameters, like with the optimizer.

We can also, for example, hyperparameter tune specific model, like a decision tree, in this case. And we can look at some visualization, like Confusion Matrices, for a specific model. This allows us to analyze, compare, and decide for a specific model easily in an interactive manner.

Towards the end, once we are very happy with the specific model, we can always export any model, either generate a corresponding function so everything that is available in the app is also, of course, available via command line so we can automate this, turn this into an executable function or script. Or we can just export the model for later usage from the app and either save it to disk or keep it in the MATLAB workspace.

So so much about a quick run through for the classificationLearner app. So let me minimize it again and continue. So, again, what we want to do for the two-class scenario, similar to the first scenario. We want to look at model sizes and accuracies. We just append it as the second row in our table. So remember, the first row here was the initial models. But the 15 classes in the second row here is the modified ones that only have two classes.

And we see that depending on the model architecture, we get quite substantial reductions in the memory consumption. Or it does not seem to have a lot of impact, for example, for the K-Nearest Neighbor models, going from 15 classes to 2 classes really did not matter too much.

So again, it's important to note that the effects of something like that, something like a reduction of the number of classes we'll have on the model really depends on the model architecture itself, similar with the accuracies. Overall, the accuracies go up. Which is expected because we're dealing with a simpler problem. Remember, originally, we had 15 classes. Now, we're only left with 2 classes. This is an easier problem, if you want. And as expected, almost all model accuracies, as a consequence, increase.

And I think, at this point, I want to talk about an additional step that you typically want to do. So from the wavelets gathering, we get quite a high number of features. So the actual dimension is something like 200, or 198 features that we considered the question is, can we do with less? Can we bring this number down to some smaller number of features that still use similar accuracies when trained with the corresponding models?

And there is many different techniques that can be applied to reduce the number of features, such as Principal Component Analysis or a technique that we apply here, which is an algorithm called Maximum Relevance Minimum Redundancy, MRMR, which just gives us some scores for the respective importance of each feature.

And we can sort those and we get a plot like that. And it has a pretty heavy tail, which is more or less typical for wavelets gathering. So we cannot have a specific break point where we say, OK, starting from feature 100, the remaining features are not important, which means we have to make some kind of decision, which is, of course, more or less arbitrary. You can use some metrics, but we decided just to cut off at 50 features and just go on with 50 features, rather than the original 198.

We can do the same exercise. And I think you got the trick now. So we have, again, our tables that we can have a look at. Now, if we retrain those models with only 50 features, rather than the 198 originally, we really see, across the board, quite substantial reduction in the memory footprint. Especially, of course, the logistic model is the smallest one anyway. And it's really small. It's only five kilobytes now. Whereas other models are still not very small. So this is still more than 1 megabyte for the Gaussian SVM model.

Similar with the accuracies, as we hoped, we don't observe a dramatic drop in the accuracy. So we can, more or less, retain the original levels, with a bit of reduction in accuracy across the board, which is to be expected. After all, we have less information available. We have less features to train on.

One specific model that we want to have a look at is the Gaussian SVM model. And we want to see whether we can bring down the memory consumption, while at the same time, probably, retaining or even increasing the accuracy values. And this is the point where I'd like to give it back to Christoph, who is going to dig into this in some more detail.

Yeah. Thanks a lot, Christoph, exactly. So, now, you've shown us how to evaluate various models and check their sizes. And now, of course, all of these models have various hyperparameters which you can use to also influence the size. And, well, how are we going to do that? Are we going to just try and manually adjust that? Of course not, that would not be very efficient, I think.

So I want to talk about hyperparameter optimization. So there's this very nice technique called Bayesian Hyperparameter Optimization. At the main part, this is bayesopt function. And it can help us to automatically evaluate and optimize our size. So what we do is we define a bunch of optimizable variables, which we want to tune for our models. And we want to also have a constraint. So we have a constraint that the model cannot exceed a certain number of these hyperparameters.

So here, we kind of get this number by eyeballing what would fit on our board. And then we set, here, a cap on the number of support vectors of our Support Vector Machine. So that's one example. So we choose Support Vector Machine. And we tell it it should have, at most, 100 support vectors. Of course, if you have another model, then you would have different hyperparameters. So support vectors are specific for Support Vector Machines. But like trees or others, all of them have specific parameters which you could put in here. And then you can optimize the accuracy while still maintaining this constraint.

And I'm just going to run this part. Now, I could just run this on my local machine. But Bayesian Optimization actually lends itself pretty nicely to parallelization. So I have this UseParallel is true option here. And I'm actually not going to run this on my local machine because my laptop has four cores. So parallelization, it doesn't really go that far.

Instead, what I did, I used a very nice, very simple offering by MathWorks. So we have this Cloud Center. In the Cloud Center, you can very easily create clusters on AWS, Amazon Web Services. so if you hit Create Cluster, then you give it a name. You select the MATLAB version. You select the machine type here. There's a quite wide range of options now from Amazon and a bunch of other settings.

And then once that's done, you create a cluster here. I already started mine. And then you can import this easily into MATLAB. So if I go back to MATLAB here, in the Parallel menu, you can discover clusters. And then all your clusters that you created under your account, they will show up. And I could just select my AWS_ParallelServer cluster here. And now, I started it up. And everything will run. I have 16 workers now in the Cloud, instead of only 4 on my laptop.

So now we have a bunch of bots which showed up. So in the left, we see the number of function evaluations plotted against the objective we're trying to minimize. We have another plot which shows us the probability of feasibility, so the probability that the next set of hyperparameters that will be evaluated actually suit-satisfies our constraints. And here on the right side, you see the degree of constraint violation for different points. So the blue points are the actual evaluations which we ran. And red is the estimated of plane.

And here on the bottom, we can see the list of function evaluations. So we can see you have 16 active workers. And let's run those iterations. Some of them run pretty fast, just a few seconds. And you can see a bunch of others, they actually take quite a long time, up to a few minutes. And this has going to run for a while now. And I think we're going to hit the fast-forward button, which is very handy, and come back once this is done.

And we can see now it's finished. And it has found a bunch of feasible points, which is good. And here, again, we can have a table which shows us the details of the various runs and whether the constraints were satisfied or not. We see we ended up with 64 iterations here. It took us 974 seconds. But the total evaluation time was 12,000 seconds. So we have a speed up of around 1,213 with 16 workers. I think this goes to show that, yeah, using this parallel computing capability here, it really makes things go a lot faster.

And from this, now it showed us the best observed combination of hyperparameters, so the best values for this box and sigma parameter for this SVM model, which we can now take. And we can, again, try the model for as long as we need with these two parameters, and then use this optimized model for deployment.

All right, so we've now done this optimization. We've shown interest for the SVM, but we've, of course, done it for the other models as well. I'm just going to load them in here. And let's just quickly look at a Confusion chart. One of these, we see, for the SVM, for example, it's doing not so badly. Of course, there's still some issues, I think mostly directional. There's sometimes misclassified as all-around sound. Now, if this accuracy's good enough or not, that really depends on your application, of course. Now, for the sake of this demo we assume that this is an acceptable Confusion matrix. So we can move on.

And let's just quickly summarize the results again so far. So we see now we have the initial sizes. With all the 198 predictors, you have the two-class models, the two-class models with only 50 features, and the two-class models with 50 features and size optimized with this hyperparameter optimization.

And we can see, again, that for some of them, we managed to reduce it quite a bit again, especially for neural network. We went now from 21 to 9 kilobytes as well here for these boost models from 180 to 46. And for others, like for example, the KNN model, it didn't really change anything. Because KNN, it doesn't really have hyperparameters, we can choose to reduce the size like this. And, of course, also the accuracies, you can see that still, they are very much acceptable, even for those optimized models as well.

For the SVM, it actually dropped probably the most of all of these, interesting enough. But for the others, we see, even with the optimized models, we get very acceptable sizes, I think especially this neural network here with two layers that's showing a lot of promise.

And, yeah, just to show this again as a graphic, so you can have this bar chart. Here we can see that, now, in the end, the logistics model is the smallest, the KNN model is the largest. But, yeah, we can see that the size is reduced quite a bit. This is actually in log scale, the sizes, so important to note this.

All right, one last summary. So the model sizes, we've seen that they went down with each step we did. The target size, we are aiming for 50 kilobytes. So we see that the median model sizes are definitely below that threshold. So we should be able to get something. We also see the accuracies. They went up when we reduced the model from 15 to 2 classes, as expected. And then they dropped down a bit with each of these reduction steps. But still, for this application now, we will assume that this loss in accuracy is very much acceptable.

So now we have models, which are small enough. And I think now, the next step is to actually bring them into Simulink, right, Christoph?

Yeah. Thanks, Joseph. That's true. So let's look at how this really would play out in a Simulink model. So for that purpose, I'm going to open a corresponding model that we've prepared. And you can see it here. It really comprises different parts. So going from left to right, we have audio input. Now, of course, at the end, it would become a microphone or, yeah, generate sensory input. Then we'll convert the audio signal to other channels, plus, basically, just adding the two channels.

We apply our wavelet scattering. So, of course, we have to replicate the preprocessing that we did in MATLAB, also on the Simulink end. We take the mean values of the features. And then we feed this into another subsystem that actually contains the classifier.

Before the signal goes into the subsystem, we have some component here that is responsible for the feature selection. So we just index into all the features in the 198, as you can see here. And then it's only 50 features that actually enter the classifier. So this kind of replicates the steps that we took when reducing our model sizes, also, on the Simulink end.

Now, this is a variant subsystem, which allows us, basically, to select from different model variants. We could have any number here. We just decided to have those three choices here. So we have Support Vector Machine. We can have Boosted Trees or Neural Network. Again, this allows us to simply switch different model types to simulate those to generate code from those.

Now, let me quickly close this again. So what we want to do as well is we want to think about, generally, what is the purpose of doing this into Simulink? And the main idea is, of course, that in a model-based design workflow, we have many methodological advantages. We can, for example, leverage add-on functionalities, such as test cases, leveraging Simulink test. We can simulate together with other components. We can generate C or C++ code from our model.

So there's many aspects that our AI model can benefit from, in an actual application scenario. So what we can, for example, look at is requirements. So typically, when you design such a hearing aid, you have to fulfill certain requirements. And this could be anything.

So, typically, those requirements could be, for example, captured in a never-to-simple Word document, or other tools. So what we want to look at here is Simulink also offers different views. And we can enter the so-called requirements perspective.

So this brings us to a view where we can look at different requirements we can define. So, in simple, admittedly simple example, we only have two of those. Namely, one is the requirement that we say, OK, we want to have the ability to select from different model architectures, for example, a Support Vector Machine or a Neural Network.

So we want our model to be able to simulate those different architectures and to yield the correct results when simulating, so either directional or all-sound for the sound category. Similarly, for the model accuracy, we can have some threshold. You could say we want to achieve a certain percentage of accuracy when simulating the model. We can already see that there is two columns here. We can see whether it's implemented as a specific test. I'm going to show that to you in a minute. And we can also look at whether the test ran successfully.

So this is done here in the Test Manager. We can see that we have defined corresponding tests, one for the actual selection of the model type, as well as the actual model accuracy. So now we can run our two tests that we defined for the model selection as well as the model accuracy.

So the Test Manager will execute the tests and display a summary of the test results in the Test Manager app. We verify that we can switch and simulate the different model types. And we can also verify that our accuracy meets the expected requirements. So we want to correctly classify the sound source as directional or all-around sound sources.

So the first test already completed, and the second as well. You can see my screen check marks. And we can see some diagnostic output. For example, we can look at the fact that the scores of the classifier as well as the predicted labels, which is, in this case, directional because it's a cafe scenario, right? And similarly, we see these results for the accuracy. We see that the string is expected. We have the directional sound source because we're dealing with the cafe scenario.

OK, this allows you, again, to connect your requirements to test implementations in the Test Manager and verify that all the requirements are covered by a test and then actually are verified. So this is the view we saw previously in the Requirements Manager. Sorry, let me open that again. We can see that the tests are both implemented and everything screened. So we verified that the tests ran through successfully. Right. So for this simple example shows how AI models could play a role in a classic MBD, workflow, where you want to also have requirements, verify them, and have connected test cases.

So what else do we have? I think there is one more thing we should talk about. And this is really now going to what's embedded hardware. And as an example, I have brought specific hardware with me. It's this Texas Instruments C2000 microcontroller. You can see it right here. I'll just hold it in the camera a bit. So it's kind of standard evaluation board.

And I installed the corresponding add-ons so that I can easily connect my hardware to my laptop. And then there is a SIL/PIL Manager that is available from Simulink IDE, which allows me to compare the model running in normal simulation mode versus the model running in processor-in-the-loop mode.

Which means I'll execute a generate code for the classifier model, for the Support Vector Machine, for example, or for the boosted tree. And now copy over the generated code to the board and execute it there, and compare the results from the code executed on the board to the results of the enormous simulation from within Simulink.

To this end, I need to change the corresponding Block Parameter and say, OK, I no longer want to simulate normal mode, but I want to simulate in Process-in-the-loop. I could also select Software-in-the-loop, by the way. So I'm going to apply that.

And then I'm going to save the model. And in the SIL/PIL tab, I just need to press this Run Verification button. And under the hood, Simulink will do two things. First of all, it will just run the simulation again in the normal mode, and use this as a baseline, and then generate code for the corresponding subsystem, plus generate code for the interfacing, so for the communication with the hardware.

And then then it will run the simulation again. But the contents of the classifier subsystem will no longer run on my laptop. Instead, they will run on my TI board here. And I can gain insight into the equivalence. So is It still doing the exact same things? And I can additionally get some metrics, so some results in terms of memory consumption and CPU load on my embedded device.

So this is all available with the SIL/PIL Manager app that I am just running now. So it takes some time to generate the code, to compile it, download it to the hardware, and run the corresponding simulation twice, as I just described. So once it's completed, it will, in the Simulation Manager, output the corresponding signals. And. We get a nice report that tells us, OK, this ran through successfully. Here is the information about the board. And we can see the generated source code files as well.

This is after successful code generation. Now it's really in the mode of running the PIL simulation. So it's just launching the PIL simulation now on the embedded device. OK, now it's actually running a simulation pretty quick. It's running pretty quickly, also, on the embedded device. And we see in the Simulation Data Inspector, we see that the difference is 0, which is a nice thing. So we don't see any differences in the prediction running either in normal simulation or on the embedded hardware, which just basically means that our model is still doing the exact same thing, which is what we want to have.

Also, we get some reports, namely Execution Profiling Report that contains additional information about interesting metrics, such as how long do specific steps in our model take in nanoseconds, like here. And we can also get information about CPU utilization, which we can see is really low for executing this specific machine learning model. We take less than 0.1% of the available CPU power.

Right. So this allows us to get some additional information about how heavy is the model really on the hardware? Right. Now, there is one more thing I'd like to highlight. And for this purpose, I'm going to go back to MATLAB. There is a last thing. Namely, I would want to talk a bit about a feature called Design Cost Estimation.

Now, for that end, I'll just open up a report that this section is generating. So this is an autogenerated PDF report, which is available only with 22b. That's why I just open it from a previous run. So I'm just going to open this report. It's a PDF file.

And I can go through it a bit. So it is basically listing information about the system of interest, namely our subsystem. And it will contain some statistics about operators. So how many operations does our code have in terms of addition, multiplications, things like that? And we get some estimation of cost.

For example, in this case, we selected to boost the trees so we can get metrics about the total cost of the system. As well as, for example, for the Classification Ensemble for the prediction, so the corresponding subsystems. And there is a screenshot.

And we can have, towards the end, there is a table that contains the information about the memory consumption. So this is called the Data Segment Variables in the table, where see the sizes in bytes, really. This goes down through bytes level, how many bytes specific subsystems in our model took. And we can look at those and get an idea of how heavy specific implementations are going to be on any hardware.

Now, this is really going down to a byte level. And it could be applied not only to the machine learning model, but also to the feature extraction steps. So to your complete design, you can just run this report and get more accurate estimate of the overall memory consumption Your implementation is going to have on the hardware.

With that being set, I think we can go back to the slides. So a few more words on the design cost estimation report I just showed. So what is really the core idea behind that? Questions that you typically might ask during your design process of any embedded algorithm is what are the root causes for a specific bottleneck? Will my complete system finally be able to be executed? My chip, is it small enough, or I make some changes, for example, to the underlying Simulink model? How do those changes affect things like performance or memory footprint?

And, now, you can summarize those questions on the specific perspective and say, OK, what is the overall cost of any design? The question is, what is really cost? You could have different definitions of a term like cost. As I said, it could mean the performance or memory footprint.

You might ask the question, why would I need such an estimate? I can just count bytes of any binary. So usually compilers have byte counters associated with them, like the object dump of GCC or the dumpbin executable on Windows systems.

But the estimates are useful after all because it might be that your design is a work in progress. So your Simulink model is maybe something you're still working on and it's probably in its current state, not ready to be cogenerated or not ready to be compiled. Probably, you don't have access to the specific compiler you need because you're still developing on your host machine. You don't have the license or you don't know exactly how to generate the binaries. But still, you don't want to wait for all those things and want to get an intermediate estimate of your expected cost on the hardware. So this is where design cost estimation can be useful.

The report showed two things. So we have a table of the data segment size, which is just an estimation of the amount of memory in bytes for a specific segment of the code, sorted by the model hierarchy, if you want. It can be applied to a complete Simulink model, including model reference hierarchy.

The other metric that we looked at is the Operator Count. It is an estimation of the size of the design based on a specific weighted sum of the operators used in the generated code, so things like additions and multiplications, and so on. It does not have a strict unit. But it's useful in the fact that it allows you to compare different designs that are implemented in Simulink. Also, it extends to a complete Simulink model, including the references.

You can do this programmatically. So as I showed you, there's a code in the live script that generates this report. And it will come as a PDF report to be shared with others, to be consumed, to be documented.

Yes. thanks a lot, Christoph. So, yeah, this was really cool to see. And I think, especially this process-in-the-loop functionality, it really adds so much because we can really get the real metrics of our model on the target directly. And PIL makes it very easy to run it, to validate that it's still doing the same thing that it did on our computer. And we can directly get the CPU estimation, the exact size on our board. So this is a very, very helpful thing to do rapid prototyping and, yeah, to just make this whole development process a lot smoother.

Of course, now you might be wondering, we've deployed this to a C2000 TI board. But how easy it be for other targets? And I think MATLAB, we can really say that we have a very broad range of deployment targets available.

Now, if we want to split it up like this, you can deploy to GPU and CPU targets, as well as FPGA targets, microcontrollers, and PLCs. And we support a very broad range of brands. So all the big brands are there. See, TI is showing up here. And there's a lot of support packages available from these. And even if you are aiming to target the board that is not listed here, that is not supported out of the box, there are ways to make it happen. So get in touch with us and we can help you to also target your custom hardware, if that's necessary.

OK, so I think now we're nearing the end. And I'd just like to close it out. Just for summarize again all that we've done now. We've done actually quite a lot of things. So again, I'd like to give an overview of the model compression workflow, how we envision it. So first, determining hardware constraints, we kind of set up this goal of having a model that's smaller than 50 kilobytes. Then we looked at how we can select and train various models. And then we showed how to simplify the model by using feature selection techniques by tuning the hyperparameters with this Bayesian hyperparameter optimization. We didn't show live the quantization step here. But this would be another thing you could be doing. And then we also talked about deployment integration. We showed processor-in-the-loop simulation.

I promised in the beginning that we have a small table, which I think it's a very helpful overview, which lists various machine learning models, which gives a qualitative estimate of their size, and which also shows which hyperparameters actually affect the model size. This is a really helpful small cheat sheet you can take a look at. And it gives you good guidelines to get started.

Now, of course, another thing I also talked about in the beginning is what about deep neural networks? And I promised a few slides. I want to reiterate these here. So there is also specific functionality for compressing deep neural networks, which are much larger than those machine learning models we've seen. They can also be more powerful for specific tasks. And they need specific techniques to reduce their size. And reducing the size here is even more important because, yeah, if they should run in real time on relatively small targets, it can be very challenging.

In general, the workflow is pretty similar. So you select a deep neural network. You simplify it. You quantize it. And then you deploy. To select the model, we have the Deep Network Designer, which is a nice graphical interface. Or sometimes you also just get the model from the literature, or from someone else. And then you can start with Pruning, that's one technique, to reduce the size, and with quantization, to go from double to another precision.

Deep learning network pruning, the idea here is that since a neural network, again, is heavily over parameterized, in general, we should be able to reduce the number of parameters. We should be able to remove certain parts of network without reducing the accuracy.

So there's two main methods. There's Unstructured Pruning and Structure Pruning. Either we is to remove a bunch of connections or we also remove neurons from the network. And we can see that, here in this small video, that we can have an original accuracy. And then we can make the network more sparse. And the more parameters you remove, at some point, we see the accuracy dropping off. So this gives us an estimate of how many parameters you can kick out.

There's another functionality coming for LSTM, Long Short Term Memory Layers. And of course, it's something we're generally continuously expanding these pruning approaches. So there's a link here as well tp this doc page, which gives a very nice overview of the techniques. The other thing is quantization. So there is a specific deep network quantizer interface now, where you can change from double precision to int8 precision for specific layers. And of course, this affects memory size dramatically.

So this app helps you to choose scaling schemes to do the quantization optimally. To reduce the footprint while maintaining accuracy. Here we have one example where we went from 6 megabytes to 860 kilobytes. I think that's pretty massive. And, yeah, in general, I think pruning and quantization, together they can really significantly compress network without affecting the accuracy. So this is another example of network, where we got a 95% reduction in size, with almost no loss in accuracy, right? Yeah. Do check this out if you're dealing with networks, I think these tools are very, very handy.

So conclusions, our couch now fits into our room. We did a good job. And all the model fits. I think the key message is that you can actually fit AI for many applications onto limited hardware. Especially when talking about machine learning models or shallow neural networks, it's definitely possible nowadays.

And MathWorks tools, well, we hope that we can show that they make fitting those AI models on constrained hardware a lot easier. And it's the same high-level workflow for any type of AI model. So it doesn't differ so much between the different models. It's a very nice one size fits all workflow.

And, yeah, I guess the question you can ask yourself is which constraints are most challenging for your application now? And if you want to learn a bit more, then we have a bunch of links here. We will share these with you after presentation, where you can learn a bit more about embedded deployment. There are some good videos on quantization. And we also include two nice user stories on blood type classification and autonomous tractors, where AI on embedded devices is being used. And I think now we're going to move to the Q&A. And Thank you very much for listening.