Description

Using Lookup Tables to Accelerate Deep Learning Inference

Name: Using Lookup Tables to Accelerate Deep Learning Inference
Uploaded: 2019-11-18T17:12:00-05:00
Duration: 2 min 10 s
Description: This video demonstrates how to replace a sigmoid function with a lookup table implementation and compares the relative execution speedup on an Arduino Due and an STMicroelectronics discovery board.

This video highlights the lookup table optimization capability to generate an efficient lookup table for a sigmoid function, which is a key activation function used in deep learning networks. We then compare the relative speedup on an Arduino Due^® and STMicroelectronics^® discovery board using the generated code for hardware in the loop simulation.

Published: 19 Nov 2019

Full Transcript

A lookup table is a key construct for embedded designs, and is often used to speed up the run-time execution of certain functions of your algorithm. For instance, complex trig functions are often replaced with a more efficient LUT implementation.

Let’s try a simple experiment – applying the same principle to the sigmoid function to investigate how we can accelerate the deep learning inference performance particularly on the edge.

The sigmoid function is a key building block for neural networks and is one of the commonly used nonlinear activation functions used in deep learning networks.

Here we have a simple Simulink subsystem that models the sigmoid function. I am going to use the Lookup Table Optimizer app to generate an optimal LUT, specifying the input and output data types. Since this is a bounded function, I can specify the bounds on the output and finally the tolerance on the output of 1%.

Once the optimization problem is solved, we can look at the comparison plot to verify that the error of the LUT approximation is within our specified tolerance.

Now as a next step, lets generate C code from the sigmoid function and the generated LUT and deploy it to a cortex M platform like the Arduino board.

We use hardware-in-the-loop simulation to run the generated code with inputs from Simulink. There is some overhead of running the code in this mode but this still gives us a good comparison of the relative execution speed.

As you can see from the execution profile, the LUT is 2.5 x faster on the Arduino. I repeated the same test on a Cortex M7 based STMicro discovery board. Here is a plot showing the relative speedup the lookup table with different data types.

In fact, this can scale up if you can share the lookup table approximation between all neurons, further decreasing the execution speed by orders of magnitude. You can do the same experiment with other activation functions like hyperbolic tangent.

To learn more about optimizing LUTs in your design, please refer to additional links below the video.

Related Resources

Related Products

Learn More

What Is Quantization?

Calculate Complex dB Using a Direct Lookup Table

Reducing Memory Footprint of Lookup Tables in Your Design

Convert Digit Recognition Neural Network to Fixed-Point and Generate C Code

Using Lookup Tables to Accelerate Deep Learning Inference

Related Products

Learn More

Fixed-Point Designer

Up Next:

Related Videos: