Levenberg Marquardt Implementation Issues

Question

2 votes

EDIT: Can you at least suggest me where to find the specific matlab function that implements the LM algorithm?

Hi all,

for personal research reasons I'm trying to implement from scratch the training algorithms featured in deep learning toolbox. Everything went well with first order algorithms such as standard backpropagation gradient descent. My custom code converges to the exact same weights/biases than the matlab's one and it's also twice faster. I suppose that the higher speed is due to the greater number of if/else instructions matlab needs to do compared with my code that is suited only for one layered networks with tanh activations.

But I'm having problems with the Levenberg Marquardt. I wrote 3 functions:

Jacobian_LM
UpdatesThroughHessianAndGradient
Levenberg_Marquardt_Backpropagation

The first one just computes the "per-example" Jacobian of the network WRT weights and biases through the standard backpropagation algorithm. It means that it outputs matrices in which every column is a gradient relative to a single input/output couple.

The second one flattens the Jacobians to a single long vector and computes the hessian matrix H and the gradient D that are needed for the LM update rule:

Then it reshapes and splits again the updates into the specific weights and biases matrices of the network with the correct size and shape.

Finally the third one is an implementation of the LM loop that at every iteration updates weights and biases of a standard matlab network object (with one hidden layer and tanh activation) until a performance improvement is achieved through an increase/decrase pattern of mu.

Here is the functions you need to try my algorithm:

function [Net] = Levenberg_Marquardt_Backpropagation(Net,x,y,mu,mu_increase_rate,mu_decrease_rate,max_mu,iterations)
%UNTITLED Summary of this function goes here
%   Detailed explanation goes here
%NETWORK WEIGHTS\BIASES STORAGE
IW = Net.IW{1,1};
Ib = Net.b{1,1};
LW = Net.LW{2,1};
Lb = Net.b{2,1};
%%%%%%%%%%%%%%%%%
for i = 1:iterations
    
    while mu <= max_mu && mu > 1e-20
        %PREV-PERFORMANCE COMPUTATION
        Pred = LW*tansig(IW*x + Ib) + Lb;
        Prev_Perf = mean((y-Pred).^2);
        %%%%%%%%%%%%%%%%%%%%%%%%
        
        %PREVIOUS WEIGHTS\BIASES STORAGE
        Prev_IW = IW;
        Prev_Ib = Ib;
        Prev_LW = LW;
        Prev_Lb = Lb;
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        
        %GRADIENT\HESSIAN COMPUTATION
        [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);
        [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        
        
        %WEIGHTS\BIASES UPDATE
        IW = IW + IWUpdate;
        Ib = Ib + IbUpdate;
        LW = LW + LWUpdate';
        Lb = Lb + LbUpdate;
        %%%%%%%%%%%
        
        
        %PERFORMANCE COMPUTATION
        Pred = LW*tansig(IW*x + Ib) + Lb;
        Perf = mean((y-Pred).^2);
        %%%%%%%%%%%%%%%%%%%%%%%%
        
        
        %PERFORMANCE CHECK
        if(Perf >= Prev_Perf)
            IW = Prev_IW;
            Ib = Prev_Ib;
            LW = Prev_LW;
            Lb = Prev_Lb;
            mu = mu*mu_increase_rate;
        else
            mu = mu*mu_decrease_rate;
            break;
        end
        %%%%%%%%%%%%%%%%%%
    end
    
end
%FINAL NET UPDATE
Net.IW{1,1} = IW;
Net.LW{2,1} = LW;
Net.b{1,1} = Ib;
Net.b{2,1} = Lb;
%%%%%%%%%%%
end
function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)
%UNTITLED Summary of this function goes here
%   Detailed explanation goes here
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s3 = size(IbJ,1);
s4 = size(LWJ,1);
s5 = size(LbJ,1);
s6 = size(IWJ,3);
Jac = nan(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);
IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);
LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);
LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);
end
function [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y) %#ok<INUSL>
%UNTITLED Summary of this function goes here
%   Detailed explanation goes here
%FORWARD PASS
zI = IW*x + Ib;
aI = tansig(zI);
%zL = LW*aI + Lb;
%aL = zL;
%%%%%%%%%%%%%%
%BACKPROPAGATION
deltaLW = ones(1,size(y,2));
deltaIW = (1 - aI.^2).*LW'.*deltaLW;
%%%%%%%%%%%%%%%%
%JACOBIAN COMPUTATION
IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
    IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end
IbJ = deltaIW;
LWJ = aI.*deltaLW;
LbJ = deltaLW;
%%%%%%%%%%%%%%%%%%%%%
end

And here's a script that you can just copy/paste to test the code (if you have the file above in your working directory of course):

rng(0)
clear
format long
warning('off')
%DEFINE A SIMPLE PROBLEM
x = rand(2,1000)*10;
x = x-5;
y = x(1,:).^2 + x(2,:).^2;
x = x/10;
%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE TRAINING PARAMETERS
disp(" ")
disp(" ")
Initial_Mu = 0.001;
Incr_Rate = 10;
Decr_Rate = 0.1;
Max_Mu = 1e10;
Epochs = 1000;
Hidden_Neurons = 20;
disp(['Initial_Mu: ',num2str(Initial_Mu)]);
disp(['Incr_Rate: ',num2str(Incr_Rate)]);
disp(['Decr_Rate: ',num2str(Decr_Rate)]);
disp(['Hidden_Neurons: ',num2str(Hidden_Neurons)]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%DEFINE AND INITIALIZE A NETWORK
net = feedforwardnet(Hidden_Neurons);
net.trainParam.epochs = Epochs;
net.trainParam.mu_dec = Decr_Rate;
net.trainParam.mu_inc = Incr_Rate;
net.trainParam.mu = Initial_Mu;
net.trainParam.mu_max = Max_Mu;
%net.trainParam.showWindow = false;
net.inputs{1,1}.processFcns = {};
net.outputs{1,2}.processFcns = {};
net.trainParam.min_grad = 1e-25;
net.trainParam.max_fail = 50;
net.divideParam.trainRatio = 1;
net.divideParam.valRatio = 0;
net.divideParam.testRatio = 0;
net = configure(net,x,y);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THAT INITIALIZED NETWORK FOR TRAINING WITH MATLAB'S TRAINLM
disp(" ")
disp(" ")
netMATLAB = net;
tic
netMATLAB = train(netMATLAB,x,y);
disp("Matlab time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%USE THE SAME INITIALIZED NETWORK FOR CUSTOM TRAINING
netCUSTOM = net;
tic
[netCUSTOM] = Levenberg_Marquardt_Backpropagation(netCUSTOM,x,y,Initial_Mu,Incr_Rate,Decr_Rate,Max_Mu,Epochs);
disp("Custom time:")
toc
disp(" ")
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%COMPARE RESULTS
Pred_Matlab = netMATLAB(x);
Pred_Custom = netCUSTOM(x);
disp("Absolute difference between Matlab and Custom outputs");
disp(mean(abs(Pred_Matlab - Pred_Custom)));
disp("Matlab MAE")
disp(mean(abs(Pred_Matlab - y)))
disp("Custom MAE")
disp(mean(abs(Pred_Custom - y)))
%%%%%%%%%%%%%%%%

This script compares both execution time and convergence.

When you run the script above you get results like following.

Here's some examples with different settings:

With 3 neurons the 2 algorithms (trainlm and my custom LM) converges to similar (yet not identical) networks:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 3
 
 
Matlab time:
Elapsed time is 0.793722 seconds.
 
Custom time:
Elapsed time is 1.711315 seconds.
 
Absolute difference between Matlab and Custom outputs
     7.440071774738044e-07
Matlab MAE
   6.179026991132736
Custom MAE
   6.179027091263226

With 5 neurons the results are very different. For some reason my algorithm converges to a better network:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 5
 
 
Matlab time:
Elapsed time is 0.883287 seconds.
 
Custom time:
Elapsed time is 1.980380 seconds.
 
Absolute difference between Matlab and Custom outputs
   0.007647618697004
Matlab MAE
   0.013716265278847
Custom MAE
   0.006069555043366

With 10 neurons the networks are pretty much the same:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 10
 
 
Matlab time:
Elapsed time is 1.011172 seconds.
 
Custom time:
Elapsed time is 2.766820 seconds.
 
Absolute difference between Matlab and Custom outputs
     1.992397943695323e-08
Matlab MAE
   0.030143779923111
Custom MAE
   0.030143774946144

With 50 neurons the results are still very similar:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 50
 
 
Matlab time:
Elapsed time is 2.985799 seconds.
 
Custom time:
Elapsed time is 7.052290 seconds.
 
Absolute difference between Matlab and Custom outputs
     8.268088436125254e-13
Matlab MAE
     2.147009752994717e-04
Custom MAE
     2.147009753220598e-04

With 100 neurons, again, the results are close:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 100
 
 
Matlab time:
Elapsed time is 8.071735 seconds.
 
Custom time:
Elapsed time is 17.729801 seconds.
 
Absolute difference between Matlab and Custom outputs
     3.318731955914700e-13
Matlab MAE
     6.768720617779367e-05
Custom MAE
     6.768720614557056e-05

Finally with 200 neurons there's a significant difference:

Initial_Mu: 0.001
Incr_Rate: 10
Decr_Rate: 0.1
Hidden_Neurons: 200
 
 
Matlab time:
Elapsed time is 23.504194 seconds.
 
Custom time:
Elapsed time is 49.679209 seconds.
 
Absolute difference between Matlab and Custom outputs
     1.711683279275178e-07
Matlab MAE
     2.712217358494099e-04
Custom MAE
     2.712337472282703e-04

BUT, if you force mu to be constant by settings the increase and decrease rate to 1, the results becomes close even with 200 neurons:

Initial_Mu: 0.001
Incr_Rate: 1
Decr_Rate: 1
Hidden_Neurons: 200
 
 
Matlab time:
Elapsed time is 16.398573 seconds.
 
Custom time:
Elapsed time is 24.675034 seconds.
 
Absolute difference between Matlab and Custom outputs
     7.406640634144424e-12
Matlab MAE
   0.020155109646983
Custom MAE
   0.020155109647200

Observations and questions:

My algorithm is way slower than trainlm. It's about 2 times slower. There's some clear bottleneck that I'm missing in my code?
It seems there's some implementation difference between trainlm and my algo. They're definitely not the same. Similar but not the same.
My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly? Does it use the Penrose pseudoinverse with some smart tolerance settings?
Considering that with a costant mu the two algos produce very similar outputs my second guess is that the difference is in the mu update pattern, probably I use a slightly different way to update the mu parameter. At every iteration I just keep updating mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?
Does matlab use some implicit minimal mu to limit the mu updates like the max_mu parameter?

I apologize for the length of this post and thank you in advance

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 23 Dec 2019

Edited: Matt J on 23 Dec 2019

Open in MATLAB Online

1 vote

My first guess is that matlab uses some clever way to invert the hessian matrix. I just used the "inv" function and disabled all the warnings. How matlab invert the hessian matrix exactly?

It probably doesn't invert the Hessian at all. A complete matrix inversion is a very inefficient way to solve a set of linear equations. More likely, it uses mldivide() or linsolve(). Compare:

>> N=6000; A=rand(N); b=rand(N,1);
>> tic; A\b; toc
Elapsed time is 2.510903 seconds.
>> tic; inv(A)*b; toc
Elapsed time is 10.053053 seconds.

EDIT:

At every iteration I just keep reducing mu with a while loop until a performance improvement is achieved. How exactly matlab updates the parameter mu through the iterations?

I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead. The mechanism by which trainlm does this is described in the trainlm documentation, in particular

"The adaptive value mu is increased by mu_inc until the change above results in a reduced performance value. The change is then made to the network and mu is decreased by mu_dec."

13 Comments
Show 11 older comments Hide 11 older comments

Roberto on 23 Dec 2019

Edited: Roberto on 23 Dec 2019

Open in MATLAB Online

Hi Matt,

thank you so much for your answer. In this function:

function [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu)
%UNTITLED Summary of this function goes here
%   Detailed explanation goes here
s1 = size(IWJ,1);
s2 = size(IWJ,2);
s3 = size(IbJ,1);
s4 = size(LWJ,1);
s5 = size(LbJ,1);
s6 = size(IWJ,3);
Jac = nan(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;
H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;
IWUpdate = reshape(Update_Tot(1:s1*s2),s1,s2);
IbUpdate = Update_Tot(s1*s2+1:s1*s2+s3);
LWUpdate = Update_Tot(s1*s2+s3+1:s1*s2+s3+s4);
LbUpdate = Update_Tot(s1*s2+s3+s4+1:s1*s2+s3+s4+s5);
end

I tried to replace this:

Update_Tot = -pinv(H + mu*eye(size(H,1)), min(H(:))/1000)*D;

with this:

a = H + mu*eye(size(H,1));
Update_Tot = -linsolve(a,D);

and with this:

a = H + mu*eye(size(H,1));
Update_Tot = -a\D;

and with this:

a = H + mu*eye(size(H,1));
Update_Tot = -mldivide(a,D);

and also with:

a = H + mu*eye(size(H,1));
Update_Tot = -inv(a)*D;

Every of these improved the execution time but the final result is still far away from the execution time of trainlm.

In particular these 3 seem to be the faster ways to compute the update vectors:

Update_Tot = -linsolve(a,D);
Update_Tot = -a\D;
Update_Tot = -mldivide(a,D);

but they need roughly the same execution time.

For instance, when matlab's trainlm requires about 5 seconds these require about 15 seconds and -inv(a)*D; requires about 25 seconds.

What am I doing wrong?

Roberto on 23 Dec 2019

Open in MATLAB Online

I just read it but I'm not sure to understand.

When you say

I don't think you want to be reducing mu in pursuit of better descent. That will make the inversion more and more singular. You want to be increasing mu instead.

I'm confused. If mu gets big LM approaches steepest descent and it gets glacially slow. If mu is small LM converges to quasi-newton that is faster.

Here's the loop we are talking about:

for i = 1:iterations
    
    while mu <= max_mu && mu > 1e-20
        %PREV-PERFORMANCE COMPUTATION
        Pred = LW*tansig(IW*x + Ib) + Lb;
        Prev_Perf = mean((y-Pred).^2);
        %%%%%%%%%%%%%%%%%%%%%%%%
        
        %PREVIOUS WEIGHTS\BIASES STORAGE
        Prev_IW = IW;
        Prev_Ib = Ib;
        Prev_LW = LW;
        Prev_Lb = Lb;
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        
        %GRADIENT\HESSIAN COMPUTATION
        [IWJ,IbJ,LWJ,LbJ] = Jacobian_LM(IW,LW,Ib,Lb,x,y);
        [IWUpdate,IbUpdate,LWUpdate,LbUpdate] = UpdatesThroughHessianAndGradient(IWJ,IbJ,LWJ,LbJ,Pred,y,mu);
        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        
        
        %WEIGHTS\BIASES UPDATE
        IW = IW + IWUpdate;
        Ib = Ib + IbUpdate;
        LW = LW + LWUpdate';
        Lb = Lb + LbUpdate;
        %%%%%%%%%%%
        
        
        %PERFORMANCE COMPUTATION
        Pred = LW*tansig(IW*x + Ib) + Lb;
        Perf = mean((y-Pred).^2);
        %%%%%%%%%%%%%%%%%%%%%%%%
        
        
        %PERFORMANCE CHECK
        if(Perf >= Prev_Perf)
            IW = Prev_IW;
            Ib = Prev_Ib;
            LW = Prev_LW;
            Lb = Prev_Lb;
            mu = mu*mu_increase_rate;
        else
            mu = mu*mu_decrease_rate;
            break;
        end
        %%%%%%%%%%%%%%%%%%
    end
    
end

I'm 99% sure that the loop structure is the same of trainlm. At every iteration a while loop starts that tries to improve the performance updating mu and weights. If new weights result in better perforamnce the while loop is over, otherwise mu is increased. In any case the loop is over if mu exceeds max_mu or 1e-20.

In the trainlm function this part is (of course there are other stop conditions here but I disabled all of them in my tests):

function [worker,calcNet] = trainingIteration(worker,calcLib,calcNet)
% Cross worker control variables
muBreak = [];
perfBreak = [];
% Levenberg Marquardt
while true
    if calcLib.isMainWorker
        muBreak = (worker.mu > worker.param.mu_max);
    end
    if calcLib.broadcast(muBreak)
        break
    end
    
    if calcLib.isMainWorker
        % Check for Singular Matrix
        [msgstr,msgid] = lastwarn;
        lastwarn('MATLAB:nothing','MATLAB:nothing')
        warnstate = warning('off','all');
        dWB = -(worker.jj + worker.ii * worker.mu) \ worker.je;
        [~,msgid1] = lastwarn;
        flag_inv = isequal(msgid1,'MATLAB:nothing');
        if flag_inv
            lastwarn(msgstr,msgid);
        end;
        warning(warnstate)
        worker.WB2 = worker.WB + dWB;
    end
    
    calcNet2 = calcLib.setwb(calcNet,worker.WB2);
    perf2 = calcLib.trainPerf(calcNet2);
    
    if calcLib.isMainWorker
        perfBreak = (perf2 < worker.perf) && flag_inv;
    end
    if calcLib.broadcast(perfBreak)
        worker.WB = worker.WB2;
        calcNet = calcNet2;
        if calcLib.isMainWorker
            worker.mu = max(worker.mu * worker.param.mu_dec,1e-20);
        end
        break
    end
    
    if calcLib.isMainWorker
        worker.mu = worker.mu * worker.param.mu_inc;
    end
end

Roberto on 23 Dec 2019

Open in MATLAB Online

I just run a check on a single epoch in order to evaluate the execution time of the single parts of the code.

Results:

Previous performance computation
Elapsed time is 0.002138 seconds.
Previous w/b storage
Elapsed time is 0.000032 seconds.
Jacobian computation
Elapsed time is 0.009249 seconds.
Hessian and gradient computation
Elapsed time is 0.012288 seconds.
W\b update
Elapsed time is 0.000297 seconds.
New perf computation
Elapsed time is 0.001972 seconds.
Perf check and mu update
Elapsed time is 0.000053 seconds.

It seems clear that there are 2 major bottlenecks:

Jacobian computation
Hessian/Gradient computation

The jacobian matrix is computed by the Jacobian_LM function in which the only slow part is this for loop:

IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
for i = 1:size(x,2)
    IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
end

Can you see any other way to compute the per-example jacobian of the input weights matrix?

deltaIW has size (num hidden neurons, num examples)

x has size (input dimension, num examples)

Any way to avoid that loop over the examples?

On the other hand the hessian and gradient ar computed by the function UpdatesThroughHessianAndGradient.

In this function there are 2 slow parts:

the building of the jacobian matrix starting with the single weights/bias jacobians in minor part:

Jac = zeros(s1*s2 + s3 + s4 + s5,s6);
Jac(1:s1*s2,:) = reshape(IWJ,s1*s2,s6);
Jac(s1*s2+1:s1*s2+s3,:) = IbJ;
Jac(s1*s2+s3+1:s1*s2+s3+s4,:) = LWJ;
Jac(s1*s2+s3+s4+1:s1*s2+s3+s4+s5,:) = LbJ;

and the inversion of:

as the slowest part:

H = (Jac*Jac')/s6;
D = mean(Jac.*(Pred - y),2);
Update_Tot = -(H + mu*eye(size(H,1)))\D;

I dont see any margin of improvement here. The only way should be avoiding the computation of per-example derivatives but I have no idea how to do that...

Roberto on 24 Dec 2019

Edited: Roberto on 24 Dec 2019

Open in MATLAB Online

Ok I tried your solution but I'm not sure to understand.

You say that this

J=reshape(IWJ,s1*s2,s6)

is the same than this:

J=kron(x,deltaIW)

Let's do an example:

4 hidden neurons

2 dimension input

5 examples in the dataset

then we have something like:

deltaIW = randi(10,4,5)
deltaIW =
     3     3     9     1     3
     8     8     4     5     4
     4     7     9     7     4
     7     2     4     9     3
     
 x = randi(10,2,5)
 x =
    10     6     2     4     3
     2     1     6     9     8
     
 IWJ = nan(size(deltaIW,1),size(x,1),size(x,2));
 for i = 1:size(x,2)
    IWJ(:,:,i) = deltaIW(:,i).*x(:,i)';
 end
 s1 = size(IWJ,1);
 s2 = size(IWJ,2);
 s6 = size(IWJ,3);
 J = reshape(IWJ,s1*s2,s6)
 J =
     30    18    18     4     9
     80    48     8    20    12
     40    42    18    28    12
     70    12     8    36     9
      6     3    54     9    24
     16     8    24    45    32
      8     7    54    63    32
     14     2    24    81    24  

but this J is definitely different from:

J2=kron(x,deltaIW)
J2 =
  30    90    10    30    18    18    54     6    18     6     6    18     2     6    12    12    36     4    12     9     9    27     3     9
  80    40    50    40    48    48    24    30    24    16    16     8    10     8    32    32    16    20    16    24    24    12    15    12
  70    90    70    40    24    42    54    42    24     8    14    18    14     8    16    28    36    28    16    12    21    27    21    12
  20    40    90    30    42    12    24    54    18    14     4     8    18     6    28     8    16    36    12    21     6    12    27     9
   6    18     2     6     3     3     9     1     3    18    18    54     6    18    27    27    81     9    27    24    24    72     8    24
  16     8    10     8     8     8     4     5     4    48    48    24    30    24    72    72    36    45    36    64    64    32    40    32
  14    18    14     8     4     7     9     7     4    24    42    54    42    24    36    63    81    63    36    32    56    72    56    32
   4     8    18     6     7     2     4     9     3    42    12    24    54    18    63    18    36    81    27    56    16    32    72    24

J is definitely contained in J2 and it could be extracted from that but there's a lot of unnecessary computation to produce J2... is more efficient to calculate the full J2 and then extract J from that?

Is there any way to perform some kind of "element wise" kron? We only need a few columns from J2...

Roberto on 24 Dec 2019

I never explained the detalis of the math I implemented so the confusion is inevitable, I'm sorry.

In the toy example we have 4 neurons and input x has dimension 2, so we have 4x2 = 8 weights in the matrix of the input weights. That's what happen in the first order algorithms where only the gradient is needed.

As you correctly mentioned this would result in a gradient D1

with shape (8,1) because we were computing the gradient of

but unfortunately it's not our case.

We need to compute the approximated Hessian H where:

And to do this we need J where J is:

the matrix with 5 columns (in our example) in which the i-th column is the gradient of Net calculated on x_i wrt weights and biases.

There's no way to avoid this since we need H.

D1 is equivaalent to D2 but we need the jacobian J in any case to compute H.

In other words we are not calculating the gradient of E that would be a vector. We are calculating the gradient of

for every x_i in the dataset and put this gradient as a column of J.

Probably "Jacobian" is not accurate because Net is a scalar function.

In any case the math is well explained here:

Levenberg Marquardt

Matt J on 24 Dec 2019

Edited: Matt J on 24 Dec 2019

Open in MATLAB Online

OK. Well, the loop-free method to compute J would be (using the same toy example),

deltaIW =[
     3     3     9     1     3
     8     8     4     5     4
     4     7     9     7     4
     7     2     4     9     3];
 
 
  x =[
    10     6     2     4     3
     2     1     6     9     8];
 
 
 [Nneurons,~]=size(deltaIW);
 [Nfeatures,Nexamples]=size(x);
 
 
 J=reshape(deltaIW,Nneurons,1,[]).*reshape(x,1,Nfeatures,[]);
 J=reshape(J,[],Nexamples),
 

 
 J =
  18    18     4     9
  48     8    20    12
  42    18    28    12
  12     8    36     9
   3    54     9    24
   8    24    45    32
   7    54    63    32
   2    24    81    24

Roberto on 24 Dec 2019

Brilliant! Your idea works greatly and now we're very close to the performance of trainlm.

I'll perform more tests tomorrow and will keep you updated.

For now, thank you and merry christmas!

Sign in to comment.

Answer 2

Matt J on 23 Dec 2019

Edited: Matt J on 23 Dec 2019

0 votes

I wonder if you have also considered using the implementation of Levenberg-Marquardt already available through lsqnonlin. By using the SpecifyObjectiveGradient option, you could still profit from your customized Jacobian calculation, but also avoid the effort of reimplementing/debugging the framework of the algorithm responsible for the iteration updates.

1 Comment
Show -1 older comments Hide -1 older comments

Roberto on 23 Dec 2019

No, i didn't. It's a very good idea, thank you.

This will be my next try if I don't manage to improve my code.

At this point it's clear to me that my algo and trainlm do the same things. The outputs of the 2 algos are the same 'til 10 significant digits or more. I jus need to understand why mine is not fully efficient. I'm close!!

Sign in to comment.

Answer 3

AHMED FAKHRI on 1 Dec 2020

0 votes

Hi@Matt J

I have a question please regarding Levenberg-Marquardt derivation. I could not understand how in the below picture, the elements of the gradient step in which G (fi(m)^2) became : 2fi(m)*G(fi(m)j. Can you please help me with this? Thanks

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Levenberg Marquardt Implementation Issues

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

13 Comments
Show 11 older comments Hide 11 older comments

More Answers (2)

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Levenberg Marquardt Implementation Issues

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

13 Comments Show 11 older comments Hide 11 older comments

More Answers (2)

1 Comment Show -1 older comments Hide -1 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

13 Comments
Show 11 older comments Hide 11 older comments

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments