Deep Learning Network Quantization for Deployment to Embedded Targets - MATLAB
Video Player is loading.
Current Time 0:00
Duration 17:45
Loaded: 0.93%
Stream Type LIVE
Remaining Time 17:45
 
1x
  • descriptions off, selected
  • en (Main), selected
    Video length is 17:45

    Deep Learning Network Quantization for Deployment to Embedded Targets

    Overview

    Quantization enables deploying semantic segmentation algorithms for Deep Learning Networks in limited resource targets.  The deployment into Arm, FPGA, and GPU targets will be shown. The challenges of maintaining the accuracy of the network while reducing both the size of the network and the size of the memory needed will be explored.

    Highlights

    • Deploying Deep Learning Networks on resource constrained targets
    • Semantic segmentation example of trained network compression while preserving accuracy
    • Generate code for deploying Deep Networks to Arm devices

    About the Presenters

    Greg Coppenrath

    Greg is the product marketing manager for Fixed-Point Designer and Deep Learning Toolbox Model Quantization Library. He has experience in the development of embedded systems and product development in the semiconductor industry. He received an MBA from Worcester Polytechnic Institute, an M.S. in Electrical Engineering from the University of Massachusetts Lowell, and received a B.S. in Electrical Engineering from WPI.

    Brenda Zhuang

    Brenda Zhuang is a software engineering manager and leads a team that develops software tools for automatic deployment of embedded applications in microprocessors and FPGAs. Brenda has contributed to the development and evolution of many new features in the MATLAB and Simulink product families. She received her Ph.D. in Systems Engineering from Boston University and M.S. in Electrical and Electronics Engineering from Hong Kong University of Science and Technology. 

    Recorded: 27 Apr 2021

    Welcome, everyone, to today's webinar. We are eager to talk about deep learning network quantization for deployment to embedded targets. My name is Greg Coppenrath. I'm the senior product marketing manager for Fixed-Point Designer in the deep learning quantization support package. I have an undergraduate degree in electrical engineering from Worcester Polytech, and I later went back for a technical marketing MBA for WPI as well.

    I have been working with embedded systems in various capacities for 20 years. Brenda Zhuang is with us today as well. She is a software engineering manager and leads a team to develop software tools for automatic deployment of embedded applications. Brenda has contributed to the development of many new features in the MATLAB and Simulink product families. She has a PhD in systems engineering from Boston University, and she has a master's in electrical and electronics engineering from the Hong Kong University of Science and Technology. Brenda will be monitoring the chat and will help me answer some questions at the end. Any comments or questions, please write them in the chat. We would love to hear from you. So, let's get started

    You might be wondering, why is there a picture of a beach? I'd like to start every day at the beach. I'm showing this as we'll use the Hamlin Beach State Park as the basis for the example we walk through later. I live in New England, so I know that time at the beach is precious. It's a very short season.

    The key takeaways today. There's three of them. The first is that quantization reduces the size of your deep network. The second is that quantization enables efficient deployment to embedded devices. And the third is that the workflow to deploy the GPU and CPU will be shown. First, we're going to go through and define quantization and why we're doing quantization at all. Then, in the second half, we'll go through an example that shows you how to generate the code.

    Let's start with the basics. What is quantization? Deep networks have a matrix of values for each layer, commonly stored as 32-bit, single-precision floating-point, which can be quantized to lower data types. For now, we're going to use a scaled 8-bit integer data type for the weights, biases, and activations.

    Here's an example of activations with a range of 0 to 9.457. So I just made up this 3 by 3 table, but you can imagine deep networks have much bigger matrices. This is how the information will be stored as 32-bit floating-point data types. If we move this to 8-bit fixed-point data types, then we need four fraction bits to represent this. And you can see there's some rounding that brings the numbers slightly above the original 32-bit floating-point numbers, and some that bring them slightly below. This will be stored in the system as a stored integer with a scale factor of 2 of the minus 4.

    Reducing the data type to represent the same number with less range and/or precision is quantization. This could result in numbers being run on to the nearest representable value, which produces a trade off of the data size, which is reduced significantly. And we're trying to keep the accuracy of the floating-point network with our fixed-point implementation.

    So, why 8-bit quantization? This is all post-training quantization approach. So 8-bit data types align well with the hardware accelerators that optimize math operations on 8-bit data. Embedded devices also have architectures that operate efficiently on 8-bit values. Next is that feature extraction with 8-bit quantize values still delivers high accuracy. Deep networks have redundancy and more computation than is needed. Accuracy of the network can mostly be obtained with a data type that is a quarter the size.

    The last reason I will touch on is a paper by Mark Horowitz at the ISSCC. He compares this based on a 45 nanometer process node, but you can imagine this would scale. It might be slightly different. But the general principle still remains. I've shown 8-bit fixed-point adds and multiplies, as well as 32-bit floating-point adds and multiplies. And when you compare these, it's 95% less power consumption to run with the 8-bit fixed-point data type.

    Why quantize networks at all? So, the first thing we look at is there's three different aspects. There's the memory size of the network. So, how much of it's going to be stored on-chip versus off-chip, how big is it in general. Then, there's how fast does this network run? How many frames per second does it take or how quickly can it return a result? And then finally, there's the accuracy of the network. How accurate is a prediction that's able to come out of this network?

    Ideally, we want to decrease the memory footprint, increase the speed at which it runs, and the accuracy hopefully will be close to the 32-bit floating-point result. But we can always iterate. So let's say, the floating-point was always 90% accurate and the 8-bit quantized version is 80%. We could go back and iterate and get that closer to the 90%.

    So, why are we doing this? The first reason is faster inference, so the answers can come back quicker, which takes fewer computational resources. The next is that we reduce the size so that we can fit more control logic, and glue logic, and communications. The next is that there's limited resources on an embedded target. So if we want to go to a lower cost device, we're going to have less memory available to us.

    Also, if we want to have more than one deep network, it helps to reduce the size of the networks, to be able to deploy in the same device. Finally, especially when the system is battery powered, a device on the edge can be always on, and we want to decrease the power consumption so that the device will last longer in the field.

    So, let's look at some impacts of quantization. So we took the ResNet-50 and we ran int8 quantization on it. You can see that 97% of the layers were able to be quantized down int8 data type. This reduced the memory footprint by 70%. And the top five accuracy was slightly reduced by a little bit more than 2%. Then, we looked at the VGG-19. In this network, 72% of the layers were able to be quantized down to the int8 data type. And the reduction was only 10%, but the accuracy was pretty much on par with the floating-point version.

    So, what actually gets quantized within The MathWorks tools? We start with what is quantized for all three supported targets-- CPU, GPU, and FPGA. First, we look at the layers. The convolution layers have lots of multiply and accumulate operations on large matrices. Doing the math in int8 is much faster. Convolution layers have the biggest impact because of all the memory axes required to do convolution. So they also have the biggest speed improvement possibility.

    Per layer of quantization is done, so remember that, generally, the network is trained in 32-bit single precision. One area of optimization-- instead of quantizing to int8 and then needing to cast back to single for input to the next layer, we can fuse the layers and keep them in int8. For example, a batch norm layer that follows a convolution layer. We know batch norm can operate natively in int8, so no need to cast back up to single.

    These optimizations of fusion need to be made with the target in mind. Not all fusions work for all targets and will achieve the target efficiencies. Looking at parameters, not only weight and biases, but also activations can be quantized.

    Now, looking at the network types that are supported, the first one is a series network, which is pretty straightforward. There's one input and one output for every layer. The next supported network is DAGNet. So I've shown here a representative example of squeezenet. So you can see that there's some areas where it-- there's two paths. There's branching. And so, the left branch and the right branch then re-merge in a gray blocks here. So care needs to be taken that quantization doesn't impact the accuracy inadvertently by quantizing without knowledge of the other branch.

    Finally, we have object detectors, and there's another layer of complexity here as well. I've shown example, Yolov2 network, and you can see it's unbalanced. So there's a point in which it branches, and there's two layers on one side, and there's eight layers on the other. This needs to come back into the gray block. So again, we need to be careful that the quantization doesn't create too much accuracy loss because of the imbalance of the network.

    Finally, you can think of LSTM networks, which have a notion of memory which can be very difficult to quantize. And so it's an interesting challenge. We don't have a solution for that now, but it is an interesting challenge.

    There are additional items that are quantize specifically for each target. So, for instance, the SSD objects for CPU and GPU we can quantize, and then the pooling data for GPU and FPGA, and then the fully connected layer for FPGAs as well.

    Next, let's look at the deployment example that we have. The example we're going to go over use a semantic segmentation. Semantic segmentation labels each pixel in an image with a class. So semantic segmentation can be used in many different applications. So there's autonomous driving or aerial photography for tracking environmental health of a region.

    For self-driving vehicles, buildings, sidewalks, and the road need to be able to be discerned. In the middle we have an object in an image, so you can see they're trying to label the actual dog versus the background pixels. Then on the right, we have the deep learning-based semantic segmentation, which can yield a precise measurement of vegetation cover from high resolution aerial images. This image is an overhead view of the same beach we saw on the Google image that I showed earlier.

    One challenge to differentiating classes with similar visual characteristics is that if you have a tree and grass, it's hard to tell the difference. They're both green. So when you're looking at just pure image classification, you have the example of the vegetation in Paris. You can pretty much see the river in the black there, but it's difficult to tell the trees. Where do the trees start and how dense are the trees?

    And then, as you move to the middle, this is drone footage of a farm. And so they're able to analyze the crops and the weeds in the progression, the stages that they're at, and optimize when you fertilize and when you harvest. And finally, on the right, with deep learning we can get a much higher level of accuracy by looking at this scene to determine how much deforestation is happening.

    We'll use the Hamlin Beach State Park data set from the Rochester Institute of Technology. It includes three near infrared channels to supplement the RGB color images. The infrared information can help us separate areas where there may be a green tree in a green field. For standard camera photo of the overhead scene would be difficult to segment the image, the infrared provides depth information.

    The data set also contains label training, validation, and test sets, with 18 object class labels. This can be used to monitor areas for beach erosion on the coast, or change in forestation levels, or algae blooms in bodies of water. This can be monitored over time with different object classes and different training sets.

    So, a little bit on the network we're going to use. So, it's a variation of the U-Net network. And in the U-Net, the initial series of convolutional layers are interspersed with max pooling layers. So starting with the image at the top, we see a decrease in the resolution of the input images as we go to the right. These layers are followed by a series of convolutional layers interspersed with up-sampling operators. As you traverse back through the network there is an increasing resolution of the input image, thus the u-shape.

    Trained deep network use memory to store input data parameters, weights, and activations from each layer as the input propagates through the network. The majority of the pre-trained neural networks from Deep Learning Toolbox use single-precision floating-point data types. Even small networks have considerable amounts of memory and hardware to perform this floating-point math. This inhibits deployment of deep learning models to devices that have low computational power and smaller memory resources, such as ARM processors.

    To reduce the memory requirements of the network, quantization selects lower precision data types, such as 8-bit scaled integers, to store the weights and activations. U-Net network was originally trained to perform prediction on biomedical image segmentation applications, like identifying a tumor or for other medical imaging. So the network is trained in the data set of the Hamlin Beach State Park data for this example. You can take the network and retrain it with modification applied to other applications as well. U-Net has high redundancy we talked about earlier, so we expect a big improvement from quantization.

    So, a high level look at the workflow we'll go through in our example. We have supported this for multiple releases now. You take a sample of images for calibration, run inference, and collect the data ranges, then quantize that can generate code for GPU, FPGA, or CPU. To access these tools, use MathWorks's Deep Learning Toolbox and the Model Quantization Support Library.

    Let's walk through the GPU workflow. First, we create a calibration and validation datastores. They'll be used to calibrate the network for quantization and validate the network after quantization. Create a dlQuantizer object and specify the network to quantize and the execution environment. Use the calibration datastore to calibrate the network. The calibration step collects dynamic range using the pre-trained network by collecting range information from their inference. To inspect the network and calibration statistics stored in the dlQuantizer object, open the Deep Network Quantizer app and import the dlQuantizer object.

    This is the entry point to the user interface, with a selection of the execution environment shown here. Remember, we talked about how the quantization can be optimized based on the target. For today's example, we're just going to show you GPU and CPU execution examples. In the Deep Learning Quantizer app, you can step through the quantization workflow.

    First, select the network from base workspace and the app displays the network structure. Next, you choose the calibration data storage instrument, all layers, and the app visualizes the range of each layer. You can then quantize and validate using the validation datastore. In this case, we quantize the compute-heavy convolution and fully connected layers, and we can see a significant reduction in memory. In the table in the middle, there are checkboxes for each layer. The ones selected are quantized. If you think a layer is contributing to accuracy loss, that layer can be deselected so it is not quantized.

    We have quantized the network now. We can start to generate code. In this section, we see using a simple fact function for code generation. We run segmentation infrastructure in the segnet_predict, and using the quantization object, we can generate code and a report.

    The generated code is hard to read here, but many of the entry points for the int8 layers for the layers support that support quantization, the CUDA libraries are shown. All the files for the generated code are highlighted on the left. In this case, we quantize the compute-heavy convolution and fully connected layers and we can see a significant reduction in memory of 65%.

    A similar workflow is supported with CPU and ARM targets. Create a dlQuantizer object and specify the network to quantize in the execution environment. In this case, CPU target. Use the calibration datastore to calibrate the network. Given the target characteristics, the collected statistics from the pre-trained network will be used for generating quantized int8 network implementation. Again, to inspect the network and calibration statistics stored in the dlQuantizer object, open the Deep Network Quantizer app and import the dlQuantizer object.

    The app visualizes the range of each layer. We quantize the compute-heavy convolution layers and we can see a significant reduction of memory. For CPU targets, we don't yet support the validation work steps. You can use PIL mode to validate the quantized networks on the embedded processors, such as Raspberry Pi.

    Setup for code generation is slightly different, using target language of C++ with an ARM compute library. We still use the segnet_predict, as identical in the GPU example, and this will generate code and a report. The code-generated report from MATLAB coder is shown here, a report of the code that can be run on an ARM processor. You can trace the generated C++ code from the Code Gen Report Viewer.

    There was a 66% reduction in learnable parameters. The accuracy is reduced from 90% to 88%. We use a custom metric function, which reports more detailed evaluations, such as the mean accuracy, and that report you can look into more.

    That completes our example with the U-Network using semantic segmentation. Let's review the key takeaways. First, we show that quantization reduces the size of your deep network. We saw that with VGG-19, ResNet-50, and a U-Net example, that we had significant memory reduction.

    I talked about what quantization is and what gets quantized. We discussed the efficiency benefits for embedded deployment. And finally, we went through the workflow to deploy the GPUs and CPUs. For more information, please find the File Exchange for the Deep Learning Toolbox Model Quantization Library on MathWorks.com. Thank you for attending this webinar.