Function dlupdate to train network with Nesterov accelerated gradient

Question

Daniel Krystian on 9 May 2024

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/2117141-function-dlupdate-to-train-network-with-nesterov-accelerated-gradient

Answered: Gayathri on 20 Sep 2024

Dear all,

I wanted to use example from matlab website to train network with Nesterov accelerated gradient. I found functions to train network with sgd, sgdm, but I couldn't find function to train network with nesterov accelerated gradient. I found on mathworks website that to create my own function to train network I have to use dlupdate function. I started with example from mathworks website (Update parameters using custom function - MATLAB dlupdate (mathworks.com)) and it works, but I don't know how to do it with Nesterov accelerated gradient. Here is my code with sgd:

[XTrain,TTrain] = digitTrain4DArrayData;
classes = categories(TTrain);
numClasses = numel(classes);
layers = [
    imageInputLayer([28 28 1],'Mean',mean(XTrain,4))
    convolution2dLayer(5,20)
    reluLayer
    convolution2dLayer(3,20,'Padding',1)
    reluLayer
    convolution2dLayer(3,20,'Padding',1)
    reluLayer
    fullyConnectedLayer(numClasses)
    softmaxLayer];
net = dlnetwork(layers);
miniBatchSize = 128;
numEpochs = 30;
numObservations = numel(TTrain);
numIterationsPerEpoch = floor(numObservations./miniBatchSize);
learnRate = 0.01;
numIterations = numEpochs * numIterationsPerEpoch;
monitor = trainingProgressMonitor(Metrics="Loss",Info="Epoch",XLabel="Iteration");
iteration = 0;
epoch = 0;
while epoch < numEpochs && ~monitor.Stop
    epoch = epoch + 1;
    
    % Shuffle data.
    idx = randperm(numel(TTrain));
    XTrain = XTrain(:,:,:,idx);
    TTrain = TTrain(idx);
    i = 0;
    while i < numIterationsPerEpoch && ~monitor.Stop
        i = i + 1;        
        iteration = iteration + 1;
        % Read mini-batch of data and convert the labels to dummy
        % variables.
        idx = (i-1)*miniBatchSize+1:i*miniBatchSize;
        X = XTrain(:,:,:,idx);
        T = zeros(numClasses, miniBatchSize,"single");
        for c = 1:numClasses
            T(c,TTrain(idx)==classes(c)) = 1;
        end
        % Convert mini-batch of data to dlarray.
        X = dlarray(single(X),"SSCB");
        % If training on a GPU, then convert data to a gpuArray.
        if canUseGPU
            X = gpuArray(X);
        end
        % Evaluate the model loss and gradients using dlfeval and the
        % modelLoss function.
        [loss,gradients] = dlfeval(@modelLoss,net,X,T);
        % Update the network parameters using the SGD algorithm defined in
        % the sgdFunction helper function.
        updateFcn = @(net,gradients) sgdFunction(net,gradients,learnRate);
        net = dlupdate(updateFcn,net,gradients);
        % Update the training progress monitor.
        recordMetrics(monitor,iteration,Loss=loss);
        updateInfo(monitor,Epoch=epoch + " of " + numEpochs);
        monitor.Progress = 100 * iteration/numIterations;
    end
end
[XTest,TTest] = digitTest4DArrayData;
XTest = dlarray(XTest,"SSCB");
if canUseGPU
    XTest = gpuArray(XTest);
end
YTest = predict(net,XTest);
[~,idx] = max(extractdata(YTest),[],1);
YTest = classes(idx);
accuracy = mean(YTest==TTest)
function [loss,gradients] = modelLoss(net,X,T)
Y = forward(net,X);
loss = crossentropy(Y,T);
gradients = dlgradient(loss,net.Learnables);
end
function parameters = sgdFunction(parameters,gradients,learnRate)
parameters = parameters - learnRate .* gradients;
end

And it gives nice result with 0.8192 accuracy score

But when I try Nesterov accelearated gradient

[XTrain,TTrain] = digitTrain4DArrayData;
classes = categories(TTrain);
numClasses = numel(classes);
layers = [
    imageInputLayer([28 28 1],'Mean',mean(XTrain,4))
    convolution2dLayer(5,20)
    reluLayer
    convolution2dLayer(3,20,'Padding',1)
    reluLayer
    convolution2dLayer(3,20,'Padding',1)
    reluLayer
    fullyConnectedLayer(numClasses)
    softmaxLayer];
net = dlnetwork(layers);
miniBatchSize = 128;
numEpochs = 30;
numObservations = numel(TTrain);
numIterationsPerEpoch = floor(numObservations./miniBatchSize);
learnRate = 0.001;
momentum = 0.9; % Momentum parameter for Nesterov algorithm
numIterations = numEpochs * numIterationsPerEpoch;
monitor = trainingProgressMonitor(Metrics="Loss",Info="Epoch",XLabel="Iteration");
iteration = 0;
epoch = 0;
velocities = []; % Initialize velocities for Nesterov algorithm
while epoch < numEpochs && ~monitor.Stop
    epoch = epoch + 1;
    
    % Shuffle data.
    idx = randperm(numel(TTrain));
    XTrain = XTrain(:,:,:,idx);
    TTrain = TTrain(idx);
    i = 0;
    while i < numIterationsPerEpoch && ~monitor.Stop
        i = i + 1;        
        iteration = iteration + 1;
        % Read mini-batch of data and convert the labels to dummy
        % variables.
        idx = (i-1)*miniBatchSize+1:i*miniBatchSize;
        X = XTrain(:,:,:,idx);
        T = zeros(numClasses, miniBatchSize,"single");
        for c = 1:numClasses
            T(c,TTrain(idx)==classes(c)) = 1;
        end
        % Convert mini-batch of data to dlarray.
        X = dlarray(single(X),"SSCB");
        % If training on a GPU, then convert data to a gpuArray.
        if canUseGPU
            X = gpuArray(X);
        end
        % Evaluate the model loss and gradients using dlfeval and the
        % modelLoss function.
        [loss,gradients] = dlfeval(@modelLoss,net,X,T);
        % Update the network parameters using the Nesterov momentum
        % algorithm defined in the nesterovFunction helper function.
        updateFcn = @(net,gradients) nesterovFunction(net, gradients, learnRate, momentum, velocities);
        net = dlupdate(updateFcn, net, gradients);
        % Update the training progress monitor.
        recordMetrics(monitor,iteration,Loss=loss);
        updateInfo(monitor,Epoch=epoch + " of " + numEpochs);
        monitor.Progress = 100 * iteration/numIterations;
    end
end
[XTest,TTest] = digitTest4DArrayData;
XTest = dlarray(XTest,"SSCB");
if canUseGPU
    XTest = gpuArray(XTest);
end
YTest = predict(net,XTest);
[~,idx] = max(extractdata(YTest),[],1);
YTest = classes(idx);
accuracy = mean(YTest==TTest)
function [loss,gradients] = modelLoss(net,X,T)
Y = forward(net,X);
loss = crossentropy(Y,T);
gradients = dlgradient(loss,net.Learnables);
    end
    function parameters = nesterovFunction(parameters, gradients, learnRate, momentum, velocities)
    % Perform Nesterov Accelerated Gradient (NAG) update.
    if isempty(velocities)
        velocities = gradients;
    else
        % Update velocity
        velocities = momentum * velocities + learnRate * gradients;
    end
    % Update parameters
    parameters = parameters - velocities;
    end

I got only 0.1 accuracy score and loss function is probably bad

I'm not sure, that this is Nesterov accelerated gradient or it is only sgdm with momentum. What is more I don't know why the loss function does not converge to zero, and why it constant.

Best regards,

Daniel

3 Comments
Show 1 older commentHide 1 older comment

John on 9 May 2024

Edited: John on 9 May 2024

Open in MATLAB Online

1. Updating the nesterovFunction:

The current implementation in your nesterovFunction is not correct for NAG.

 lookaheadParams = parameters - momentum * velocities;
function [parameters, velocities] = nesterovFunction(parameters, gradients, learnRate, momentum, velocities, X, T)
    % Perform Nesterov Accelerated Gradient (NAG) update.
    if isempty(velocities)
        velocities = zeros(size(gradients));
    end
    % Lookahead step
    lookaheadParams = parameters - momentum * velocities;
    % Compute gradients at the lookahead point
    [~, lookaheadGradients] = dlfeval(@modelLoss, lookaheadParams, X, T);
    % Update velocity
    velocities = momentum * velocities + learnRate * lookaheadGradients;
    % Update parameters
    parameters = parameters - velocities;
  
end

- Initialize velocities with zeros if it's empty.
- Perform the lookahead step by computing lookaheadParams using the previous velocity.
- Evaluate the gradients at the lookahead point using dlfeval and your modelLoss function.
- Update the velocity using the lookahead gradients.
- Update the parameters using the updated velocity.

2. Updating the dlupdate call:

Modify the dlupdate call to update both net and velocities:

updateFcn = @(net,gradients,velocities) nesterovFunction(net, gradients, learnRate, momentum, velocities, X, T);
[net, velocities] = dlupdate(updateFcn, net, gradients, velocities);

3. Adjusting the learning rate:

The learning rate (learnRate) of 0.001 might be too small, causing slow convergence. Try increasing it to a higher value, such as 0.01, and observe the impact on the training progress.

Daniel Krystian on 13 May 2024

Edited: Daniel Krystian on 13 May 2024

Hi,

I'm not sure where I should place this line od code : "lookaheadParams = parameters - momentum * velocities;" (Line before nesterovFunction). I tried in different places, but I couldn't run code without any errors.

[XTrain,TTrain] = digitTrain4DArrayData;

classes = categories(TTrain);

numClasses = numel(classes);

layers = [

imageInputLayer([28 28 1],'Mean',mean(XTrain,4))

convolution2dLayer(5,20)

reluLayer

convolution2dLayer(3,20,'Padding',1)

reluLayer

convolution2dLayer(3,20,'Padding',1)

reluLayer

fullyConnectedLayer(numClasses)

softmaxLayer];

net = dlnetwork(layers);

miniBatchSize = 128;

numEpochs = 30;

numObservations = numel(TTrain);

numIterationsPerEpoch = floor(numObservations./miniBatchSize);

learnRate = 0.01;

momentum = 0.9; % Momentum parameter for Nesterov algorithm

numIterations = numEpochs * numIterationsPerEpoch;

monitor = trainingProgressMonitor(Metrics="Loss",Info="Epoch",XLabel="Iteration");

iteration = 0;

epoch = 0;

velocities = []; % Initialize velocities for Nesterov algorithm

while epoch < numEpochs && ~monitor.Stop

epoch = epoch + 1;

% Shuffle data.

idx = randperm(numel(TTrain));

XTrain = XTrain(:,:,:,idx);

TTrain = TTrain(idx);

i = 0;

while i < numIterationsPerEpoch && ~monitor.Stop

i = i + 1;

iteration = iteration + 1;

% Read mini-batch of data and convert the labels to dummy

% variables.

idx = (i-1)*miniBatchSize+1:i*miniBatchSize;

X = XTrain(:,:,:,idx);

T = zeros(numClasses, miniBatchSize,"single");

for c = 1:numClasses

T(c,TTrain(idx)==classes(c)) = 1;

end

% Convert mini-batch of data to dlarray.

X = dlarray(single(X),"SSCB");

% If training on a GPU, then convert data to a gpuArray.

if canUseGPU

X = gpuArray(X);

end

% Evaluate the model loss and gradients using dlfeval and the

% modelLoss function.

[loss,gradients] = dlfeval(@modelLoss,net,X,T);

% Update the network parameters using the Nesterov momentum

% algorithm defined in the nesterovFunction helper function.

updateFcn = @(net,gradients,velocities) nesterovFunction(net, gradients, learnRate, momentum, velocities, X, T);

[net, velocities] = dlupdate(updateFcn, net, gradients, velocities);

% Update the training progress monitor.

recordMetrics(monitor,iteration,Loss=loss);

updateInfo(monitor,Epoch=epoch + " of " + numEpochs);

monitor.Progress = 100 * iteration/numIterations;

end

[XTest,TTest] = digitTest4DArrayData;

XTest = dlarray(XTest,"SSCB");

if canUseGPU

XTest = gpuArray(XTest);

end

YTest = predict(net,XTest);

[~,idx] = max(extractdata(YTest),[],1);

YTest = classes(idx);

accuracy = mean(YTest==TTest)

function [loss,gradients] = modelLoss(net,X,T)

Y = forward(net,X);

loss = crossentropy(Y,T);

gradients = dlgradient(loss,net.Learnables);

end

lookaheadParams = parameters - momentum * velocities;

function [parameters, velocities] = nesterovFunction(parameters, gradients, learnRate, momentum, velocities, X, T)

% Perform Nesterov Accelerated Gradient (NAG) update.

if isempty(velocities)

velocities = zeros(size(gradients));

end

% Lookahead step

lookaheadParams = parameters - momentum * velocities;

% Compute gradients at the lookahead point

[~, lookaheadGradients] = dlfeval(@modelLoss, lookaheadParams, X, T);

% Update velocity

velocities = momentum * velocities + learnRate * lookaheadGradients;

% Update parameters

parameters = parameters - velocities;

end

Daniel Krystian on 13 May 2024

Error: File: Nesterov.m Line: 80 Column: 2

Function definitions in a script must appear at the end of the file.

Move all statements after the "modelLoss" function definition to before the first local function definition.

Sign in to comment.

Sign in to answer this question.

Answer 1

Gayathri on 20 Sep 2024

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/2117141-function-dlupdate-to-train-network-with-nesterov-accelerated-gradient#answer_1519865

Open in MATLAB Online

Hello @Daniel Krystian,

I understand that you want to adapt the “dlupdate” example based on “SGD” to “Nesterov accelerated gradient”. Also, I can see that a test accuracy of only 0.1 is obtained with code given in the question.

I can see that the “NAG” function is not implemented correctly. The “velocities” variable is always initialized to “gradients” in the code. It can be modified as shown below.

function [parameters, velocities] = nesterovFunction(parameters, gradients,velocities, learnRate, momentum) 
    velocities = momentum * velocities - learnRate * gradients;
    parameters = parameters + velocities;
end 

To call this “nesterovFunction” first we need to initialize “velocities” variable to a format suitable for the “dlupdate” function. It needs the “velocities” variable to be in table format with values given under “Value” header. Hence the following code can be followed to initialise the variable.

i=2; 
idx = (i-1)*miniBatchSize+1:i*miniBatchSize; 
X = XTrain(:,:,:,idx); 
T = zeros(numClasses, miniBatchSize,"single"); 
for c = 1:numClasses 
    T(c,TTrain(idx)==classes(c)) = 1; 
end 
% Convert mini-batch of data to dlarray. 
X = dlarray(single(X),"SSCB"); 
% If training on a GPU, then convert data to a gpuArray. 
if canUseGPU 
    X = gpuArray(X); 
end 
[loss,gradients] = dlfeval(@modelLoss,net,X,T); 
velocities=gradients; 

After this initialization the function can be called to update “gradients” and “velocities” variable.

updateFcn = @(net,gradients,velocities) nesterovFunction(net, gradients,velocities, learnRate, momentum); 
[net, velocities] = dlupdate(updateFcn, net, gradients, velocities); 

With these changes, I am able to obtain a test accuracy of 0.9882 . I have kept the “learning rate” to be 0.01. The training loss curve is shown below.

For more information on “NAG” please refer to the following link.

https://medium.com/konvergen/momentum-method-and-nesterov-accelerated-gradient-487ba776c987

Function dlupdate to train network with Nesterov accelerated gradient

3 Comments
Show 1 older commentHide 1 older comment

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Function dlupdate to train network with Nesterov accelerated gradient

3 Comments Show 1 older commentHide 1 older comment

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments