Levenberg Marquardt Implementation Issues

14 views (last 30 days)
EDIT: Can you at least suggest me where to find the specific matlab function that implements the LM algorithm?
Hi all,
for personal research reasons I'm trying to implement from scratch the training algorithms featured in deep learning toolbox. Everything went well with first order algorithms such as standard backpropagation gradient descent. My custom code converges to the exact same weights/biases than the matlab's one and it's also twice faster. I suppose that the higher speed is due to the greater number of if/else instructions matlab needs to do compared with my code that is suited only for one layered networks with tanh activations.
But I'm having problems with the Levenberg Marquardt. I wrote 3 functions:
  1. Jacobian_LM
  2. UpdatesThroughHessianAndGradient
  3. Levenberg_Marquardt_Backpropagation
The first one just computes the "per-example" Jacobian of the network WRT weights and biases through the standard backpropagation algorithm. It means that it outputs matrices in which every column is a gradient relative to a single input/output couple.
The second one flattens the Jacobians to a single long vector and computes the hessian matrix H and the gradient D that are needed for the LM update rule:
Then it reshapes and splits again the updates into the specific weights and biases matrices of the network with the correct size and shape.
Finally the third one is an implementation of the LM loop that at every iteration updates weights and biases of a standard matlab network object (with one hidden layer and tanh activation) until a performance improvement is achieved through an increase/decrase pattern of mu.
Here is the functions you need to try my algorithm:
function [Net] = Levenberg_Marquardt_Backpropagation(Net,x,y,mu,mu_increase_rate,mu_decrease_rate,max_mu,iterations)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
%NETWORK WEIGHTS\BIASES STORAGE
IW = Net.IW{1,1};
Ib = Net.b{1,1};
LW = Net.LW{2,1};
Lb = Net.b{2,1};
%%%%%%%%%%%%%%%%%
for i = 1:iterations
while mu <= max_mu && mu > 1e-20
%PREV-PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Prev_Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PREVIOUS WEIGHTS\BIASES STORAGE
Prev_IW = IW;
Prev_Ib = Ib;
Prev_LW = LW;
Prev_Lb = Lb;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%GRADIENT\HESSIAN COMPUTATION
[IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);
[IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%WEIGHTS\BIASES UPDATE
IW = IW + IWUpdate;
Ib = Ib + IbUpdate;
LW = LW + LWUpdate';
Lb = Lb + LbUpdate;
%%%%%%%%%%%
%PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PERFORMANCE CHECK
if(Perf >= Prev_Perf)
IW = Prev_IW;
Ib = Prev_Ib;
LW = Prev_LW;
Lb = Prev_Lb;
mu = mu*mu_increase_rate;
else
mu = mu*mu_decrease_rate;
break;
end
%%%%%%%%%%%%%%%%%%
end
end
%FINAL NET UPDATE
Net.IW{1,1} = IW;
Net.LW{2,1} = LW;
Net.b{1,1} = Ib;
Net.b{2,1} = Lb;
%%%%%%%%%%%
end
function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s3 = size(IbJ,1);
s4 = size(LWJ,1);
s5 = size(LbJ,1);
s6 = size(IWJ,3);
Jac = nan(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);
IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);
LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);
LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);
end
function [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y) %#ok<INUSL>
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
%FORWARD PASS
zI = IW*x + Ib;
aI = tansig(zI);
%zL = LW*aI + Lb;
%aL = zL;
%%%%%%%%%%%%%%
%BACKPROPAGATION
deltaLW = ones(1,size(y,2));
deltaIW = (1 - aI.^2).*LW'.*deltaLW;
%%%%%%%%%%%%%%%%
%JACOBIAN COMPUTATION
IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end
IbJ = deltaIW;
LWJ = aI.*deltaLW;
LbJ = deltaLW;
%%%%%%%%%%%%%%%%%%%%%
end
And here's a script that you can just copy/paste to test the code (if you have the file above in your working directory of course):
rng(0)
clear
format long
warning('off')
%DEFINE A SIMPLE PROBLEM
x = rand(2,1000)*10;
x = x-5;
y = x(1,:).^2 + x(2,:).^2;
x = x/10;
%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE TRAINING PARAMETERS
disp(" ")
disp(" ")
Initial_Mu = 0.001;
Incr_Rate = 10;
Decr_Rate = 0.1;
Max_Mu = 1e10;
Epochs = 1000;
Hidden_Neurons = 20;
disp(['Initial_Mu: ',num2str(Initial_Mu)]);
disp(['Incr_Rate: ',num2str(Incr_Rate)]);
disp(['Decr_Rate: ',num2str(Decr_Rate)]);
disp(['Hidden_Neurons: ',num2str(Hidden_Neurons)]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE AND INITIALIZE A NETWORK
net = feedforwardnet(Hidden_Neurons);
net.trainParam.epochs = Epochs;
net.trainParam.mu_dec = Decr_Rate;
net.trainParam.mu_inc = Incr_Rate;
net.trainParam.mu = Initial_Mu;
net.trainParam.mu_max = Max_Mu;
%net.trainParam.showWindow = false;
net.inputs{1,1}.processFcns = {};
net.outputs{1,2}.processFcns = {};
net.trainParam.min_grad = 1e-25;
net.trainParam.max_fail = 50;
net.divideParam.trainRatio = 1;
net.divideParam.valRatio = 0;
net.divideParam.testRatio = 0;
net = configure(net,x,y);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THAT INITIALIZED NETWORK FOR TRAINING WITH MATLAB'S TRAINLM
disp(" ")
disp(" ")
netMATLAB = net;
tic
netMATLAB = train(netMATLAB,x,y);
disp("Matlab time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THE SAME INITIALIZED NETWORK FOR CUSTOM TRAINING
netCUSTOM = net;
tic
[netCUSTOM] = Levenberg_Marquardt_Backpropagation(netCUSTOM,x,y,Initial_Mu,Incr_Rate,Decr_Rate,Max_Mu,Epochs);
disp("Custom time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%COMPARE RESULTS
Pred_Matlab = netMATLAB(x);
Pred_Custom = netCUSTOM(x);
disp("Absolute difference between Matlab and Custom outputs");
disp(mean(abs(Pred_Matlab - Pred_Custom)));
disp("Matlab MAE")
disp(mean(abs(Pred_Matlab - y)))
disp("Custom MAE")
disp(mean(abs(Pred_Custom - y)))
%%%%%%%%%%%%%%%%
This script compares both execution time and convergence.
When you run the script above you get results like following.
Here's some examples with different settings:
With 3 neurons the 2 algorithms (trainlm and my custom LM) converges to similar (yet not identical) networks:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 3
Matlab time:
Elapsed time is 0.793722 seconds.
Custom time:
Elapsed time is 1.711315 seconds.
Absolute difference between Matlab and Custom outputs
7.440071774738044e-07
Matlab MAE
6.179026991132736
Custom MAE
6.179027091263226
With 5 neurons the results are very different. For some reason my algorithm converges to a better network:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 5
Matlab time:
Elapsed time is 0.883287 seconds.
Custom time:
Elapsed time is 1.980380 seconds.
Absolute difference between Matlab and Custom outputs
0.007647618697004
Matlab MAE
0.013716265278847
Custom MAE
0.006069555043366
With 10 neurons the networks are pretty much the same:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 10
Matlab time:
Elapsed time is 1.011172 seconds.
Custom time:
Elapsed time is 2.766820 seconds.
Absolute difference between Matlab and Custom outputs
1.992397943695323e-08
Matlab MAE
0.030143779923111
Custom MAE
0.030143774946144
With 50 neurons the results are still very similar:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 50
Matlab time:
Elapsed time is 2.985799 seconds.
Custom time:
Elapsed time is 7.052290 seconds.
Absolute difference between Matlab and Custom outputs
8.268088436125254e-13
Matlab MAE
2.147009752994717e-04
Custom MAE
2.147009753220598e-04
With 100 neurons, again, the results are close:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 100
Matlab time:
Elapsed time is 8.071735 seconds.
Custom time:
Elapsed time is 17.729801 seconds.
Absolute difference between Matlab and Custom outputs
3.318731955914700e-13
Matlab MAE
6.768720617779367e-05
Custom MAE
6.768720614557056e-05
Finally with 200 neurons there's a significant difference:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 200
Matlab time:
Elapsed time is 23.504194 seconds.
Custom time:
Elapsed time is 49.679209 seconds.
Absolute difference between Matlab and Custom outputs
1.711683279275178e-07
Matlab MAE
2.712217358494099e-04
Custom MAE
2.712337472282703e-04
BUT, if you force mu to be constant by settings the increase and decrease rate to 1, the results becomes close even with 200 neurons:
Initial_Mu: 0.001
Incr_Rate: 1
Decr_Rate: 1
Hidden_Neurons: 200
Matlab time:
Elapsed time is 16.398573 seconds.
Custom time:
Elapsed time is 24.675034 seconds.
Absolute difference between Matlab and Custom outputs
7.406640634144424e-12
Matlab MAE
0.020155109646983
Custom MAE
0.020155109647200
Observations and questions:
  1. My algorithm is way slower than trainlm. It's about 2 times slower. There's some clear bottleneck that I'm missing in my code?
  2. It seems there's some implementation difference between trainlm and my algo. They're definitely not the same. Similar but not the same.
  3. My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly? Does it use the Penrose pseudoinverse with some smart tolerance settings?
  4. Considering that with a costant mu the two algos produce very similar outputs my second guess is that the difference is in the mu update pattern, probably I use a slightly different way to update the mu parameter. At every iteration I just keep updating mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
  5. Does matlab use some implicit minimal mu to limit the mu updates like the max_mu parameter?
I apologize for the length of this post and thank you in advance

Accepted Answer

Matt J
Matt J on 23 Dec 2019
Edited: Matt J on 23 Dec 2019
My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly?
It probably doesn't invert the Hessian at all. A complete matrix inversion is a very inefficient way to solve a set of linear equations. More likely, it uses mldivide() or linsolve(). Compare:
>> N=6000; A=rand(N); b=rand(N,1);
>> tic; A\b; toc
Elapsed time is 2.510903 seconds.
>> tic; inv(A)*b; toc
Elapsed time is 10.053053 seconds.
EDIT:
At every iteration I just keep reducing mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead. The mechanism by which trainlm does this is described in the trainlm documentation, in particular
"The adaptive value mu is increased by mu_inc until the change above results in a reduced performance value. The change is then made to the network and mu is decreased by mu_dec."
  13 Comments
Matt J
Matt J on 24 Dec 2019
Edited: Matt J on 24 Dec 2019
OK. Well, the loop-free method to compute J would be (using the same toy example),
deltaIW =[
3 3 9 1 3
8 8 4 5 4
4 7 9 7 4
7 2 4 9 3];
x =[
10 6 2 4 3
2 1 6 9 8];
[Nneurons,~]=size(deltaIW);
[Nfeatures,Nexamples]=size(x);
J=reshape(deltaIW,Nneurons,1,[]).*reshape(x,1,Nfeatures,[]);
J=reshape(J,[],Nexamples),
J =
30 18 18 4 9
80 48 8 20 12
40 42 18 28 12
70 12 8 36 9
6 3 54 9 24
16 8 24 45 32
8 7 54 63 32
14 2 24 81 24
Roberto
Roberto on 24 Dec 2019
Brilliant! Your idea works greatly and now we're very close to the performance of trainlm.
I'll perform more tests tomorrow and will keep you updated.
For now, thank you and merry christmas!

Sign in to comment.

More Answers (2)

Matt J
Matt J on 23 Dec 2019
Edited: Matt J on 23 Dec 2019
I wonder if you have also considered using the implementation of Levenberg-Marquardt already available through lsqnonlin. By using the SpecifyObjectiveGradient option, you could still profit from your customized Jacobian calculation, but also avoid the effort of reimplementing/debugging the framework of the algorithm responsible for the iteration updates.
  1 Comment
Roberto
Roberto on 23 Dec 2019
No, i didn't. It's a very good idea, thank you.
This will be my next try if I don't manage to improve my code.
At this point it's clear to me that my algo and trainlm do the same things. The outputs of the 2 algos are the same 'til 10 significant digits or more. I jus need to understand why mine is not fully efficient. I'm close!!

Sign in to comment.


AHMED FAKHRI
AHMED FAKHRI on 1 Dec 2020
Hi@Matt J
I have a question please regarding Levenberg-Marquardt derivation. I could not understand how in the below picture, the elements of the gradient step in which G (fi(m)^2) became : 2fi(m)*G(fi(m)j. Can you please help me with this? Thanks

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!