Compressing Neural Networks Using Network Projection
By Antoni Woss, MathWorks
Deploying increasingly large and complex deep learning networks onto resource-constrained devices is a growing challenge facing many AI practitioners. This is because modern deep neural networks typically run on highly performing processors and GPUs and feature many millions of learnable parameters. Deploying these powerful deep learning models onto edge devices often requires compressing them to reduce size on disk, runtime memory, and inference times, while attempting to retain high accuracy and model expressivity.
There are numerous techniques for compressing deep learning networks—such as pruning and quantization—that can be used in tandem. This article introduces a new technique: network projection. This technique analyzes the covariance of neural excitations on layers of interest and reduces the number of learnable parameters by modifying layers to operate in a projective space. Although the operations of the layer take place in a typically lower rank projective space, the expressivity of the layer remains high as the width—i.e., the number of neural activations—remains unchanged when compared to the original network architecture. This can be used in place of or in addition to pruning and quantization.
Network Compression via Projection
Training a deep neural network amounts to determining a point in the corresponding high-dimensional space of learnable parameters that minimizes some specified loss function. The most common approaches to determining this optimal value are through gradient descent minimizations and variations thereof. To overcome a lack of knowledge of the underlying function that you seek to determine, deep network architectures with an abundance of learnable parameters are proposed to facilitate sufficient expressivity needed to find a suitable function. These can ultimately be vastly over-parameterized for the task at hand and necessarily result in a high degree of correlation between neurons in the network.
Network projection approaches the problem of compression by analyzing these neural correlations before carefully introducing projective operations that ultimately reduce the number of learnable parameters in the network, while retaining the important neural relations to preserve accuracy and expressivity.
Background
To set up the discussion, begin by defining a simple network,
Here,
A neuron is defined as an elementwise component of
Neural Covariance
As a deep learning network trains, learnable parameters move around the high-dimensional space in a correlated way, with a trajectory determined by choice of minimization routine, network initialization, and the training data. The neurons in the resulting trained network can therefore exhibit a high degree of correlation. You can measure the covariance between the neurons on each layer by stimulating the neurons with data representative of the true data distribution you train the network on, and then compute the neural covariance matrix with respect to these neural excitations.
To determine the neural covariance matrix for layer
where
The covariance matrix is positive semi-definite, symmetric, and has eigenvalues
This analysis of covariance motivates the construction of a projection operator,
Here, eigenvectors are ordered with respect to decreasing eigenvalues,
where
The Projection Framework
The projection operation applied to the neurons of a layer defines a projected layer. More formally, the projection operation can be used to define a mapping from a layer to its projected equivalent,
where the input and output neurons are projected. This mapping extends to the projection of a network. A projected network is the composition of projected layers:
Applying the projection operator layer-by-layer simply amounts to transforming the neurons in each layer to the corresponding eigenneurons, reducing the rank by eliminating low-variance eigenneurons, then transforming back into the original basis of neurons before feeding into the next layer (Figure 2).
In this framework, different layers can be projected by varying amounts, depending on the rank of the layers input and output projection operators. Furthermore, any layer can be swapped for its projected equivalent without a need to change the downstream architecture, making this projective technique applicable to almost any network architecture, not just feedforward networks.
Compression
For many layers with learnable parameters—such as fully connected, convolution, and LSTM—the layer operations can be modified to absorb the projection operation into the learnable parameters. This ultimately reduces the number of learnable parameters on the projected layer with respect to the original, thus compressing the layer (Figure 3).
For the fully connected layer in Figure 1 and the projected fully connected layer in Figure 3—assuming for simplicity that there is no bias term—each edge represents a learnable parameter, or an element in the weight matrix. For Figure 1, there are six learnable parameters in the weight matrix whereas the projected layer has five. This reduces the size of the layer, trading one large matrix multiplication for two smaller ones.
It is important to note that the distribution of the training data drives the construction of the projection operation, which, in turn, drives the decomposition of the weight matrices. This contrasts with simply taking a singular value decomposition of the weight matrices in the layers as the neural covariance matrix may favor a different subspace to that spanned by the SVD factors.
The precise implementation of projection depends on the layer itself. Furthermore, the number of neurons on each layer—
Fine-Tuning
Starting with a pretrained network, the projected network equivalent can be directly computed as outlined in the framework above. Although a high degree of data covariance is retained, the network may need fine-tuning as accuracy can drop following the compression. This could be a result of higher order nonlinear relations between neurons in the network needing to readjust given the linear transformation or a compounding effect if many layers are projected at once. The projected model provides a good initialization for fine-tuning by further retraining. The fine-tune retraining typically takes far fewer epochs for the accuracy to plateau and can largely recover, in many cases, the accuracy of the original network.
Examples and More
One illustration of where compression using projection has been successfully applied is in building a virtual sensor for battery state-of-charge (BSOC) estimation.
For a variety of reasons, it is not physically possible to build a deployable sensor to measure SOC directly. This is a common challenge for many systems across industries, not only for batteries. A tempting choice is to estimate SOC using an extended Kalman filter (EKF). An EKF can be extremely accurate, but it requires having a nonlinear mathematical model of a battery, which is not always available or feasible to create. A Kalman filter can also be computationally expensive and, if an initial state is wrong or of the model is not correct, it can produce inaccurate results.
An alternative is to develop a virtual sensor using an RNN network featuring LSTM layers. Such models have been shown to produce highly accurate results when trained on quality data, however, the models are often large in size and yield slow inference speeds. The LSTM layers in the network can be compressed using projection to reduce the overall network size and improve inference speed, while retaining high accuracy on the data set. A typical architecture for a projected LSTM network used for BSOC estimation is shown in Table 1.
Name | Type | Activations | Learnable Properties | |
1 | sequenceinput Sequence input with three dimensions |
Sequence Input | 3(C) x 1(B) x 1(T) | - |
2 | lstm_1 Projected LSTM layer with 256 hidden units |
Projected LSTM | 256(C) x 1(B) x 1(T) | InputWeights 1024 x 3 Recurrent Weights 1024 x 11 Bias 1024 x 1 InputProjector 3 x 3 OutputProjector 256 x 11 |
3 | dropout_1 20% dropout |
Dropout | 256(C) x 1(B) x 1(T) | - |
4 | lstm_2 Projected LSTM layer with 128 hidden units |
Projected LSTM | 128(C) x 1(B) x 1(T) | InputWeights 512 x 11 Recurrent Weights 512 x 8 Bias 512 x 1 InputProjector 256 x 11 OutputProjector 128 x 8 |
5 | dropout_2 20% dropout |
Dropout | 128(C) x 1(B) x 1(T) | - |
6 | fc One fully connected layer |
Fully Connected | 1(C) x 1(B) x 1(T) | Weights 1 x 128 Bias 1 x 1 |
7 | layer sigmoid |
Sigmoid | 1(C) x 1(B) x 1(T) | - |
8 | regressionoutput mean-squared-error with response ‘Response’ |
Regression Output | 1(C) x 1(B) x 1(T) | - |
Table 1. Analysis of a projected LSTM network architecture used in BSOC estimation.
Figure 4 shows a comparison of model accuracy (as a measure of RMSE), model size (as the number of learnable parameters), and inference speed (running as a MEX file in MATLAB), for this BSOC RNN network with an LSTM layer, before and after projection and fine-tuning.
To project and compress networks of your own, try out the features available from MATLAB R2022b, compressNetworkUsingProjection
and neuronPCA
, by installing Deep Learning Toolbox™ together with the Deep Learning Toolbox Model Quantization Library.
Published 2023