Levenberg Marquardt Implementation Issues

EDIT: Can you at least suggest me where to find the specific matlab function that implements the LM algorithm?
Hi all,
for personal research reasons I'm trying to implement from scratch the training algorithms featured in deep learning toolbox. Everything went well with first order algorithms such as standard backpropagation gradient descent. My custom code converges to the exact same weights/biases than the matlab's one and it's also twice faster. I suppose that the higher speed is due to the greater number of if/else instructions matlab needs to do compared with my code that is suited only for one layered networks with tanh activations.
But I'm having problems with the Levenberg Marquardt. I wrote 3 functions:
  1. Jacobian_LM
  2. UpdatesThroughHessianAndGradient
  3. Levenberg_Marquardt_Backpropagation
The first one just computes the "per-example" Jacobian of the network WRT weights and biases through the standard backpropagation algorithm. It means that it outputs matrices in which every column is a gradient relative to a single input/output couple.
The second one flattens the Jacobians to a single long vector and computes the hessian matrix H and the gradient D that are needed for the LM update rule:
Then it reshapes and splits again the updates into the specific weights and biases matrices of the network with the correct size and shape.
Finally the third one is an implementation of the LM loop that at every iteration updates weights and biases of a standard matlab network object (with one hidden layer and tanh activation) until a performance improvement is achieved through an increase/decrase pattern of mu.
Here is the functions you need to try my algorithm:
function [Net] = Levenberg_Marquardt_Backpropagation(Net,x,y,mu,mu_increase_rate,mu_decrease_rate,max_mu,iterations)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
%NETWORK WEIGHTS\BIASES STORAGE
IW = Net.IW{1,1};
Ib = Net.b{1,1};
LW = Net.LW{2,1};
Lb = Net.b{2,1};
%%%%%%%%%%%%%%%%%
for i = 1:iterations
while mu <= max_mu && mu > 1e-20
%PREV-PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Prev_Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PREVIOUS WEIGHTS\BIASES STORAGE
Prev_IW = IW;
Prev_Ib = Ib;
Prev_LW = LW;
Prev_Lb = Lb;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%GRADIENT\HESSIAN COMPUTATION
[IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);
[IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%WEIGHTS\BIASES UPDATE
IW = IW + IWUpdate;
Ib = Ib + IbUpdate;
LW = LW + LWUpdate';
Lb = Lb + LbUpdate;
%%%%%%%%%%%
%PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PERFORMANCE CHECK
if(Perf >= Prev_Perf)
IW = Prev_IW;
Ib = Prev_Ib;
LW = Prev_LW;
Lb = Prev_Lb;
mu = mu*mu_increase_rate;
else
mu = mu*mu_decrease_rate;
break;
end
%%%%%%%%%%%%%%%%%%
end
end
%FINAL NET UPDATE
Net.IW{1,1} = IW;
Net.LW{2,1} = LW;
Net.b{1,1} = Ib;
Net.b{2,1} = Lb;
%%%%%%%%%%%
end
function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s3 = size(IbJ,1);
s4 = size(LWJ,1);
s5 = size(LbJ,1);
s6 = size(IWJ,3);
Jac = nan(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);
IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);
LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);
LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);
end
function [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y) %#ok<INUSL>
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
%FORWARD PASS
zI = IW*x + Ib;
aI = tansig(zI);
%zL = LW*aI + Lb;
%aL = zL;
%%%%%%%%%%%%%%
%BACKPROPAGATION
deltaLW = ones(1,size(y,2));
deltaIW = (1 - aI.^2).*LW'.*deltaLW;
%%%%%%%%%%%%%%%%
%JACOBIAN COMPUTATION
IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end
IbJ = deltaIW;
LWJ = aI.*deltaLW;
LbJ = deltaLW;
%%%%%%%%%%%%%%%%%%%%%
end
And here's a script that you can just copy/paste to test the code (if you have the file above in your working directory of course):
rng(0)
clear
format long
warning('off')
%DEFINE A SIMPLE PROBLEM
x = rand(2,1000)*10;
x = x-5;
y = x(1,:).^2 + x(2,:).^2;
x = x/10;
%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE TRAINING PARAMETERS
disp(" ")
disp(" ")
Initial_Mu = 0.001;
Incr_Rate = 10;
Decr_Rate = 0.1;
Max_Mu = 1e10;
Epochs = 1000;
Hidden_Neurons = 20;
disp(['Initial_Mu: ',num2str(Initial_Mu)]);
disp(['Incr_Rate: ',num2str(Incr_Rate)]);
disp(['Decr_Rate: ',num2str(Decr_Rate)]);
disp(['Hidden_Neurons: ',num2str(Hidden_Neurons)]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE AND INITIALIZE A NETWORK
net = feedforwardnet(Hidden_Neurons);
net.trainParam.epochs = Epochs;
net.trainParam.mu_dec = Decr_Rate;
net.trainParam.mu_inc = Incr_Rate;
net.trainParam.mu = Initial_Mu;
net.trainParam.mu_max = Max_Mu;
%net.trainParam.showWindow = false;
net.inputs{1,1}.processFcns = {};
net.outputs{1,2}.processFcns = {};
net.trainParam.min_grad = 1e-25;
net.trainParam.max_fail = 50;
net.divideParam.trainRatio = 1;
net.divideParam.valRatio = 0;
net.divideParam.testRatio = 0;
net = configure(net,x,y);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THAT INITIALIZED NETWORK FOR TRAINING WITH MATLAB'S TRAINLM
disp(" ")
disp(" ")
netMATLAB = net;
tic
netMATLAB = train(netMATLAB,x,y);
disp("Matlab time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THE SAME INITIALIZED NETWORK FOR CUSTOM TRAINING
netCUSTOM = net;
tic
[netCUSTOM] = Levenberg_Marquardt_Backpropagation(netCUSTOM,x,y,Initial_Mu,Incr_Rate,Decr_Rate,Max_Mu,Epochs);
disp("Custom time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%COMPARE RESULTS
Pred_Matlab = netMATLAB(x);
Pred_Custom = netCUSTOM(x);
disp("Absolute difference between Matlab and Custom outputs");
disp(mean(abs(Pred_Matlab - Pred_Custom)));
disp("Matlab MAE")
disp(mean(abs(Pred_Matlab - y)))
disp("Custom MAE")
disp(mean(abs(Pred_Custom - y)))
%%%%%%%%%%%%%%%%
This script compares both execution time and convergence.
When you run the script above you get results like following.
Here's some examples with different settings:
With 3 neurons the 2 algorithms (trainlm and my custom LM) converges to similar (yet not identical) networks:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 3
Matlab time:
Elapsed time is 0.793722 seconds.
Custom time:
Elapsed time is 1.711315 seconds.
Absolute difference between Matlab and Custom outputs
7.440071774738044e-07
Matlab MAE
6.179026991132736
Custom MAE
6.179027091263226
With 5 neurons the results are very different. For some reason my algorithm converges to a better network:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 5
Matlab time:
Elapsed time is 0.883287 seconds.
Custom time:
Elapsed time is 1.980380 seconds.
Absolute difference between Matlab and Custom outputs
0.007647618697004
Matlab MAE
0.013716265278847
Custom MAE
0.006069555043366
With 10 neurons the networks are pretty much the same:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 10
Matlab time:
Elapsed time is 1.011172 seconds.
Custom time:
Elapsed time is 2.766820 seconds.
Absolute difference between Matlab and Custom outputs
1.992397943695323e-08
Matlab MAE
0.030143779923111
Custom MAE
0.030143774946144
With 50 neurons the results are still very similar:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 50
Matlab time:
Elapsed time is 2.985799 seconds.
Custom time:
Elapsed time is 7.052290 seconds.
Absolute difference between Matlab and Custom outputs
8.268088436125254e-13
Matlab MAE
2.147009752994717e-04
Custom MAE
2.147009753220598e-04
With 100 neurons, again, the results are close:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 100
Matlab time:
Elapsed time is 8.071735 seconds.
Custom time:
Elapsed time is 17.729801 seconds.
Absolute difference between Matlab and Custom outputs
3.318731955914700e-13
Matlab MAE
6.768720617779367e-05
Custom MAE
6.768720614557056e-05
Finally with 200 neurons there's a significant difference:
Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 200
Matlab time:
Elapsed time is 23.504194 seconds.
Custom time:
Elapsed time is 49.679209 seconds.
Absolute difference between Matlab and Custom outputs
1.711683279275178e-07
Matlab MAE
2.712217358494099e-04
Custom MAE
2.712337472282703e-04
BUT, if you force mu to be constant by settings the increase and decrease rate to 1, the results becomes close even with 200 neurons:
Initial_Mu: 0.001
Incr_Rate: 1
Decr_Rate: 1
Hidden_Neurons: 200
Matlab time:
Elapsed time is 16.398573 seconds.
Custom time:
Elapsed time is 24.675034 seconds.
Absolute difference between Matlab and Custom outputs
7.406640634144424e-12
Matlab MAE
0.020155109646983
Custom MAE
0.020155109647200
Observations and questions:
  1. My algorithm is way slower than trainlm. It's about 2 times slower. There's some clear bottleneck that I'm missing in my code?
  2. It seems there's some implementation difference between trainlm and my algo. They're definitely not the same. Similar but not the same.
  3. My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly? Does it use the Penrose pseudoinverse with some smart tolerance settings?
  4. Considering that with a costant mu the two algos produce very similar outputs my second guess is that the difference is in the mu update pattern, probably I use a slightly different way to update the mu parameter. At every iteration I just keep updating mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
  5. Does matlab use some implicit minimal mu to limit the mu updates like the max_mu parameter?
I apologize for the length of this post and thank you in advance

 Accepted Answer

Matt J
Matt J on 23 Dec 2019
Edited: Matt J on 23 Dec 2019
My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly?
It probably doesn't invert the Hessian at all. A complete matrix inversion is a very inefficient way to solve a set of linear equations. More likely, it uses mldivide() or linsolve(). Compare:
>> N=6000; A=rand(N); b=rand(N,1);
>> tic; A\b; toc
Elapsed time is 2.510903 seconds.
>> tic; inv(A)*b; toc
Elapsed time is 10.053053 seconds.
EDIT:
At every iteration I just keep reducing mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead. The mechanism by which trainlm does this is described in the trainlm documentation, in particular
"The adaptive value mu is increased by mu_inc until the change above results in a reduced performance value. The change is then made to the network and mu is decreased by mu_dec."

13 Comments

Hi Matt,
thank you so much for your answer. In this function:
function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s3 = size(IbJ,1);
s4 = size(LWJ,1);
s5 = size(LbJ,1);
s6 = size(IWJ,3);
Jac = nan(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);
IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);
LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);
LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);
end
I tried to replace this:
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
with this:
a = H + mu*eye(size(H,1));
Update_Tot = -linsolve(a,D);
and with this:
a = H + mu*eye(size(H,1));
Update_Tot = -a\D;
and with this:
a = H + mu*eye(size(H,1));
Update_Tot = -mldivide(a,D);
and also with:
a = H + mu*eye(size(H,1));
Update_Tot = -inv(a)*D;
Every of these improved the execution time but the final result is still far away from the execution time of trainlm.
In particular these 3 seem to be the faster ways to compute the update vectors:
Update_Tot = -linsolve(a,D);
Update_Tot = -a\D;
Update_Tot = -mldivide(a,D);
but they need roughly the same execution time.
For instance, when matlab's trainlm requires about 5 seconds these require about 15 seconds and -inv(a)*D; requires about 25 seconds.
What am I doing wrong?
Well, there are still potential issues in the way that you are updating mu, see my edited answer above.
I just read it but I'm not sure to understand.
When you say
I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead.
I'm confused. If mu gets big LM approaches steepest descent and it gets glacially slow. If mu is small LM converges to quasi-newton that is faster.
Here's the loop we are talking about:
for i = 1:iterations
while mu <= max_mu && mu > 1e-20
%PREV-PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Prev_Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PREVIOUS WEIGHTS\BIASES STORAGE
Prev_IW = IW;
Prev_Ib = Ib;
Prev_LW = LW;
Prev_Lb = Lb;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%GRADIENT\HESSIAN COMPUTATION
[IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);
[IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%WEIGHTS\BIASES UPDATE
IW = IW + IWUpdate;
Ib = Ib + IbUpdate;
LW = LW + LWUpdate';
Lb = Lb + LbUpdate;
%%%%%%%%%%%
%PERFORMANCE COMPUTATION
Pred = LW*tansig(IW*x + Ib) + Lb;
Perf = mean((y-Pred).^2);
%%%%%%%%%%%%%%%%%%%%%%%%
%PERFORMANCE CHECK
if(Perf >= Prev_Perf)
IW = Prev_IW;
Ib = Prev_Ib;
LW = Prev_LW;
Lb = Prev_Lb;
mu = mu*mu_increase_rate;
else
mu = mu*mu_decrease_rate;
break;
end
%%%%%%%%%%%%%%%%%%
end
end
I'm 99% sure that the loop structure is the same of trainlm. At every iteration a while loop starts that tries to improve the performance updating mu and weights. If new weights result in better perforamnce the while loop is over, otherwise mu is increased. In any case the loop is over if mu exceeds max_mu or 1e-20.
In the trainlm function this part is (of course there are other stop conditions here but I disabled all of them in my tests):
function [worker,calcNet] = trainingIteration(worker,calcLib,calcNet)
% Cross worker control variables
muBreak = [];
perfBreak = [];
% Levenberg Marquardt
while true
if calcLib.isMainWorker
muBreak = (worker.mu > worker.param.mu_max);
end
if calcLib.broadcast(muBreak)
break
end
if calcLib.isMainWorker
% Check for Singular Matrix
[msgstr,msgid] = lastwarn;
lastwarn('MATLAB:nothing','MATLAB:nothing')
warnstate = warning('off','all');
dWB = -(worker.jj + worker.ii * worker.mu) \ worker.je;
[~,msgid1] = lastwarn;
flag_inv = isequal(msgid1,'MATLAB:nothing');
if flag_inv
lastwarn(msgstr,msgid);
end;
warning(warnstate)
worker.WB2 = worker.WB + dWB;
end
calcNet2 = calcLib.setwb(calcNet,worker.WB2);
perf2 = calcLib.trainPerf(calcNet2);
if calcLib.isMainWorker
perfBreak = (perf2 < worker.perf) && flag_inv;
end
if calcLib.broadcast(perfBreak)
worker.WB = worker.WB2;
calcNet = calcNet2;
if calcLib.isMainWorker
worker.mu = max(worker.mu * worker.param.mu_dec,1e-20);
end
break
end
if calcLib.isMainWorker
worker.mu = worker.mu * worker.param.mu_inc;
end
end
Matt J
Matt J on 23 Dec 2019
Edited: Matt J on 23 Dec 2019
If new weights result in better perforamnce the while loop is over, otherwise mu is increased.
That is the correct thing to do, but it is opposite to what you said you do in your original post. In your original post you said mu decreases until performance improves:
"I just keep reducing mu with a while loop until a performance improvement is achieved"
That is totally wrong. My fault.
(just edited the first message)
I just run a check on a single epoch in order to evaluate the execution time of the single parts of the code.
Results:
Previous performance computation
Elapsed time is 0.002138 seconds.
Previous w/b storage
Elapsed time is 0.000032 seconds.
Jacobian computation
Elapsed time is 0.009249 seconds.
Hessian and gradient computation
Elapsed time is 0.012288 seconds.
W\b update
Elapsed time is 0.000297 seconds.
New perf computation
Elapsed time is 0.001972 seconds.
Perf check and mu update
Elapsed time is 0.000053 seconds.
It seems clear that there are 2 major bottlenecks:
  • Jacobian computation
  • Hessian/Gradient computation
The jacobian matrix is computed by the Jacobian_LM function in which the only slow part is this for loop:
IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end
Can you see any other way to compute the per-example jacobian of the input weights matrix?
deltaIW has size (num hidden neurons, num examples)
x has size (input dimension, num examples)
Any way to avoid that loop over the examples?
On the other hand the hessian and gradient ar computed by the function UpdatesThroughHessianAndGradient.
In this function there are 2 slow parts:
the building of the jacobian matrix starting with the single weights/bias jacobians in minor part:
Jac = zeros(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
and the inversion of: as the slowest part:
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -(H + mu*eye(size(H,1)))\D;
I dont see any margin of improvement here. The only way should be avoiding the computation of per-example derivatives but I have no idea how to do that...
Can you see any other way to compute the per-example jacobian of the input weights matrix?
Yes, the way to compute it is not to compute it. The thing to recognize here is that
J=reshape(IWJ,s1*s2,s6)
is the same as
J=kron(x,deltaIW)
and there are special computational identities for Kronecker products that can be used here to compute their contributions to H and D, in lieu of actually expanding the Kronecker products.
I don't quite understand why you have
H=(Jac*Jac.')/s6
D = mean(Jac.*(Pred - y),2);
when I think instead it is supposed to be
H=(Jac.'*Jac)/s6
D=Jac.'*(Pred - y)
If I assume the latter is what you meant, then the contribution of IJW to these expressions is
H_ijw = J.'*J/s6= kron(x'*x,deltaIW'*deltaW)/s6
E=Pred(1:s1*s2)-y(1:s1*s2);
D_ijw=deltaIW.'*(reshape(E,s1,s2))*x;
Thank you Matt, it's been enlightening. I need some time to try to recode stuff with the kronecker tensor operator. I'll probably annoy you again tomorrow.
PS:
I don't quite understand why you have
D = mean(Jac.*(Pred - y),2);
Here you're totally right. Again. It should be:
D = Jac*(Pred - y)'/s6;
Ok I tried your solution but I'm not sure to understand.
You say that this
J=reshape(IWJ,s1*s2,s6)
is the same than this:
J=kron(x,deltaIW)
Let's do an example:
4 hidden neurons
2 dimension input
5 examples in the dataset
then we have something like:
deltaIW = randi(10,4,5)
deltaIW =
3 3 9 1 3
8 8 4 5 4
4 7 9 7 4
7 2 4 9 3
x = randi(10,2,5)
x =
10 6 2 4 3
2 1 6 9 8
IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s6 = size(IWJ,3);
J = reshape(IWJ,s1*s2,s6)
J =
30 18 18 4 9
80 48 8 20 12
40 42 18 28 12
70 12 8 36 9
6 3 54 9 24
16 8 24 45 32
8 7 54 63 32
14 2 24 81 24
but this J is definitely different from:
J2=kron(x,deltaIW)
J2 =
30 30 90 10 30 18 18 54 6 18 6 6 18 2 6 12 12 36 4 12 9 9 27 3 9
80 80 40 50 40 48 48 24 30 24 16 16 8 10 8 32 32 16 20 16 24 24 12 15 12
40 70 90 70 40 24 42 54 42 24 8 14 18 14 8 16 28 36 28 16 12 21 27 21 12
70 20 40 90 30 42 12 24 54 18 14 4 8 18 6 28 8 16 36 12 21 6 12 27 9
6 6 18 2 6 3 3 9 1 3 18 18 54 6 18 27 27 81 9 27 24 24 72 8 24
16 16 8 10 8 8 8 4 5 4 48 48 24 30 24 72 72 36 45 36 64 64 32 40 32
8 14 18 14 8 4 7 9 7 4 24 42 54 42 24 36 63 81 63 36 32 56 72 56 32
14 4 8 18 6 7 2 4 9 3 42 12 24 54 18 63 18 36 81 27 56 16 32 72 24
J is definitely contained in J2 and it could be extracted from that but there's a lot of unnecessary computation to produce J2... is more efficient to calculate the full J2 and then extract J from that?
Is there any way to perform some kind of "element wise" kron? We only need a few columns from J2...
You're right, I did the math too quickly. However, I don't see how
J =
30 18 18 4 9
80 48 8 20 12
40 42 18 28 12
70 12 8 36 9
6 3 54 9 24
16 8 24 45 32
8 7 54 63 32
14 2 24 81 24
is correct either. The number of columns of a Jacobian is supposed to be equal to the number of unknown parameters, but in the J above, it is equal to the number of examples. How many unknowns are we supposed to be estimating in this toy problem?
I never explained the detalis of the math I implemented so the confusion is inevitable, I'm sorry.
In the toy example we have 4 neurons and input x has dimension 2, so we have 4x2 = 8 weights in the matrix of the input weights. That's what happen in the first order algorithms where only the gradient is needed.
As you correctly mentioned this would result in a gradient D1
with shape (8,1) because we were computing the gradient of
but unfortunately it's not our case.
We need to compute the approximated Hessian H where:
And to do this we need J where J is:
the matrix with 5 columns (in our example) in which the i-th column is the gradient of Net calculated on x_i wrt weights and biases.
There's no way to avoid this since we need H.
D1 is equivaalent to D2 but we need the jacobian J in any case to compute H.
In other words we are not calculating the gradient of E that would be a vector. We are calculating the gradient of
for every x_i in the dataset and put this gradient as a column of J.
Probably "Jacobian" is not accurate because Net is a scalar function.
In any case the math is well explained here:
OK. Well, the loop-free method to compute J would be (using the same toy example),
deltaIW =[
3 3 9 1 3
8 8 4 5 4
4 7 9 7 4
7 2 4 9 3];
x =[
10 6 2 4 3
2 1 6 9 8];
[Nneurons,~]=size(deltaIW);
[Nfeatures,Nexamples]=size(x);
J=reshape(deltaIW,Nneurons,1,[]).*reshape(x,1,Nfeatures,[]);
J=reshape(J,[],Nexamples),
J =
30 18 18 4 9
80 48 8 20 12
40 42 18 28 12
70 12 8 36 9
6 3 54 9 24
16 8 24 45 32
8 7 54 63 32
14 2 24 81 24
Brilliant! Your idea works greatly and now we're very close to the performance of trainlm.
I'll perform more tests tomorrow and will keep you updated.
For now, thank you and merry christmas!

Sign in to comment.

More Answers (2)

Matt J
Matt J on 23 Dec 2019
Edited: Matt J on 23 Dec 2019
I wonder if you have also considered using the implementation of Levenberg-Marquardt already available through lsqnonlin. By using the SpecifyObjectiveGradient option, you could still profit from your customized Jacobian calculation, but also avoid the effort of reimplementing/debugging the framework of the algorithm responsible for the iteration updates.

1 Comment

No, i didn't. It's a very good idea, thank you.
This will be my next try if I don't manage to improve my code.
At this point it's clear to me that my algo and trainlm do the same things. The outputs of the 2 algos are the same 'til 10 significant digits or more. I jus need to understand why mine is not fully efficient. I'm close!!

Sign in to comment.

Hi@Matt J
I have a question please regarding Levenberg-Marquardt derivation. I could not understand how in the below picture, the elements of the gradient step in which G (fi(m)^2) became : 2fi(m)*G(fi(m)j. Can you please help me with this? Thanks

Asked:

on 21 Dec 2019

Answered:

on 1 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!