Long Short-Term Memory (LSTM)

What Is Long Short-Term Memory (LSTM)?

Learn how LSTMs work, where to apply LSTMs, and how to design LSTMs

A long short-term memory (LSTM) network is a type of recurrent neural network (RNN). LSTMs are predominantly used to learn, process, and classify sequential data because they can learn long-term dependencies between time steps of data.

How LSTMs Work

LSTMs and RNNs

LSTM networks are a specialized form of the RNN architecture. RNNs use past information to improve the performance of a neural network on current and future inputs. They contain a hidden state and loops, which allow the network to store past information in the hidden state and operate on sequences. RNNs have two sets of weights: one for the hidden state vector and one for the inputs. During training, the network learns weights for both the inputs and the hidden state. When implemented, the output is based on the current input, as well as the hidden state, which is based on previous inputs.

In practice, simple RNNs are limited in their capacity to learn longer-term dependencies. RNNs are commonly trained through backpropagation, in which they may experience either a vanishing or exploding gradient problem. These problems cause the network weights to become either very small or very large, limiting effectiveness in applications that require the network to learn long-term relationships.

The RNN uses a hidden state on the input, which is also used as an additional input to the RNN at the next time step.

Data flow at time step t for a traditional RNN.

LSTM Layer Architecture

LSTM layers use additional gates to control what information in the hidden state is exported as output and to the next hidden state. These additional gates overcome the common issue with RNNs in learning long-term dependencies. In addition to the hidden state in traditional RNNs, the architecture for an LSTM block typically has a memory cell, input gate, output gate, and forget gate. The additional gates enable the network to learn long-term relationships in the data more effectively. Lower sensitivity to the time gap makes LSTM networks better for analyzing sequential data than simple RNNs. In the figure below, you can see the LSTM architecture and data flow at time step t.

An LSTM network uses additional units such as the forget gate and memory cell that prevent it from vanishing and exploding gradient problems.

Data flow at time step t for an LSTM unit. The forget gate and memory cell prevent the vanishing and exploding gradient problems.

The weights and biases to the input gate control the extent to which a new value flows into the LSTM unit. Similarly, the weights and biases to the forget gate and output gate control the extent to which a value remains in the unit and the extent to which the value in the unit is used to compute the output activation of the LSTM block, respectively.

The following diagram illustrates the data flow through an LSTM layer with multiple time steps. The number of channels in the output matches the number of hidden units in the LSTM layer.

Diagram showing how information propagates through the multiple steps of an LSTM layer.

Data flow for an LSTM with multiple time steps. Each LSTM operation receives the hidden state and cell state from the previous operation and passes an updated state and cell state to the next operation.

LSTM Network Architecture

LSTMs work well with sequence and time-series data for classification and regression tasks. LSTMs also work well on videos because videos are essentially a sequence of images. Similar to working with signals, it helps to perform feature extraction before feeding the sequence of images into the LSTM layer. Leverage convolutional neural networks (CNNs) (e.g., GoogLeNet) for feature extraction on each frame. The following figure shows how to design an LSTM network for different tasks.

Diagram of the LSTM network architecture with layers used to build an RNN for different tasks.

LSTM network architecture for classification, regression, and video classification tasks.

Bidirectional LSTM

A bidirectional LSTM (BiLSTM) learns bidirectional dependencies between time steps of time-series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step. BiLSTM networks enable additional training because the input data is passed through the LSTM layer twice, which can increase the performance of your network.

A BiLSTM consists of two LSTM components: the forward LSTM and the backward LSTM. The forward LSTM operates from the first time step to the last time step. The backward LSTM operates from the last time step to the first time step. After passing the data through the two LSTM components, the operation concatenates the outputs along the channel dimension.

Diagram showing the forward and backward LSTM operations in a BiLSTM.

Architecture of a BiLSTM with multiple time steps.

Get Started with LSTM Examples in MATLAB

LSTM Applications

LSTMs are particularly effective for working with sequential data, which can vary in length, and learning long-term dependencies between time steps of that data. Common LSTM applications include sentiment analysis, language modeling, speech recognition, and video analysis.

Broad LSTM Applications

RNNs are a key technology in applications such as:

  • Signal processing. Signals are naturally sequential data, as they are often collected from sensors over time. Automatic classification and regression on large signal data sets allow prediction in real time. Raw signal data can be fed into deep networks or preprocessed to focus on specific features, such as frequency components. Feature extraction can greatly improve network performance.
  • Natural language processing (NLP). Language is naturally sequential, and pieces of text vary in length. LSTMs are a great tool for natural language processing tasks, such as text classification, text generation, machine translation, and sentiment analysis, because they can learn to contextualize words in a sentence.

Try the following examples to start applying LSTMs to signal processing and natural language processing.

Vertical LSTM Applications

Using LSTM Networks to Estimate NOx Emissions

Renault engineers used LSTMs in developing next-generation technology for zero-emissions vehicles (ZEVs).

They obtained their training data from tests conducted on an actual engine. During these tests, the engine was put through common drive cycles. The captured data, which included engine torque, engine speed, coolant temperature, and gear number emissions, provided the input to the LSTM network. After iterations on the design of the LSTM architecture, the final version of the LSTM achieved 85–90% accuracy in predicting NOX levels.

LSTMs with MATLAB

Using MATLAB® with Deep Learning Toolbox™ enables you to design, train, and deploy LSTMs. Using Text Analytics Toolbox™ or Signal Processing Toolbox™ allows you to apply LSTMs to text or signal analysis.

Design and Train Networks

You can design and train LSTMs programmatically with a few lines of code. Use LSTM layers, bidirectional LSTM layers, and LSTM projected layers to build LSTMs. You can also design, analyze, and modify LSTMs interactively using the Deep Network Designer app.

Screenshot of a simple BiLSTM network, interactively built with the Deep Network Designer app.

Using the Deep Network Designer app for interactively building, visualizing, and editing LSTM networks.

Import and Export Networks

You can exchange LSTM networks with Python®-based deep learning frameworks:

  • Import PyTorch®, TensorFlow™, and ONNX™ models with one line of code.
  • Interactively import PyTorch and TensorFlow models with Deep Network Designer.
  • Export LSTM networks to TensorFlow and ONNX with one line of code.
Diagram showing interoperability for LSTMs and other deep neural networks between MATLAB, TensorFlow, ONNX, and PyTorch.

Converting LSTM networks between MATLAB, TensorFlow, ONNX, and PyTorch.

Deploy Networks

Deploy your trained LSTM on embedded systems, enterprise systems, or the cloud:

  • Automatically generate optimized C/C++ code and CUDA code for deployment to CPUs and GPUs.
  • Generate synthesizable Verilog® and VHDL® code for deployment to FPGAs and SoCs.
Diagram showing MATLAB and Simulink code generation for deploying deep neural networks to CPUs, GPUs, microcontrollers, and FPGAs.

Quickly deploy trained deep learning networks to production.