PPO training Stopped Learning.

Question

Lloyd on 21 Aug 2024

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning

Answered: Kaustab Pal on 22 Aug 2024

I am trying to train the rotatry inverted pendulum enviroment using a PPO agent. It's working... but It's reaching a limit and not learnign past this limit. I am not too sure why. Newbie to RL here so go easy on me :). I think it's something to do with the yellow line, Q0. Also it could be reaching a local optima, but I don't think this is the problem. I think the problem is with Q0 not getting past 100 and the agent not being able to extract more useful info. Hopefully, someone whith a little more experinace has something to say!

mdl = "rlQubeServoModel";
open_system(mdl)
theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
Ts = 0.005;
rng(22)
obsInfo = rlNumericSpec([7 1]);
actInfo = rlNumericSpec([1 1],UpperLimit=1,LowerLimit=-1);
agentBlk = mdl + "/RL Agent";
simEnv = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);
numObs = prod(obsInfo.Dimension);
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
% critic:
criticNetwork = [
    featureInputLayer(numObs)
    fullyConnectedLayer(criticLayerSizes(1), ...
        Weights=sqrt(2/numObs)*...
            (rand(criticLayerSizes(1),numObs)-0.5), ...
        Bias=1e-3*ones(criticLayerSizes(1),1))
    reluLayer
    fullyConnectedLayer(criticLayerSizes(2), ...
        Weights=sqrt(2/criticLayerSizes(1))*...
            (rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ...
        Bias=1e-3*ones(criticLayerSizes(2),1))
    reluLayer
    fullyConnectedLayer(1, ...
        Weights=sqrt(2/criticLayerSizes(2))* ...
            (rand(1,criticLayerSizes(2))-0.5), ...
        Bias=1e-3)
    ];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
critic = rlValueFunction(criticNetwork,obsInfo);
% actor:
% Input path layers
inPath = [ 
    featureInputLayer( ...
        prod(obsInfo.Dimension), ...
        Name="netOin")
    fullyConnectedLayer( ...
        prod(actInfo.Dimension), ...
        Name="infc") 
    ];
% Path layers for mean value 
meanPath = [ 
    tanhLayer(Name="tanhMean");
    fullyConnectedLayer(prod(actInfo.Dimension));
    scalingLayer(Name="scale", ...
    Scale=actInfo.UpperLimit) 
    ];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [ 
    tanhLayer(Name="tanhStdv");
    fullyConnectedLayer(prod(actInfo.Dimension));
    softplusLayer(Name="splus") 
    ];
net = dlnetwork();
net = addLayers(net,inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
plot(net)
net = initialize(net);
summary(net)
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
    ActionMeanOutputNames="scale",...
    ActionStandardDeviationOutputNames="splus",...
    ObservationInputNames="netOin");
actorOpts = rlOptimizerOptions(LearnRate=1e-4);
criticOpts = rlOptimizerOptions(LearnRate=1e-4);
agentOpts = rlPPOAgentOptions(...
    ExperienceHorizon=600,...
    ClipFactor=0.02,...
    EntropyLossWeight=0.01,...
    ActorOptimizerOptions=actorOpts,...
    CriticOptimizerOptions=criticOpts,...
    NumEpoch=3,...
    AdvantageEstimateMethod="gae",...
    GAEFactor=0.95,...
    SampleTime=0.1,...
    DiscountFactor=0.997);
agent = rlPPOAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
    MaxEpisodes=20000,...
    MaxStepsPerEpisode=600,...
    Plots="training-progress",...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=430,...
    ScoreAveragingWindowLength=100);
trainingStats = train(agent, simEnv, trainOpts);

thanks in advanced!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

arushi on 22 Aug 2024

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning#answer_1503004

Edited: arushi on 22 Aug 2024

Hi Lloyd,

Some potential reasons why your training might be hitting a plateau and not improving further:

Q0 and Learning Plateau:

The variable Q0 might refer to the initial Q-value or a specific parameter in your environment or model. If it's not progressing past a certain point, it might be due to insufficient exploration or suboptimal hyperparameters.

Exploration vs. Exploitation:

Ensure your agent is exploring adequately. The entropy loss weight (EntropyLossWeight) in PPO helps encourage exploration by adding randomness to the policy. You might try increasing this value slightly to see if it helps the agent explore more diverse actions.

Learning Rates:

The learning rates for both the actor and critic (LearnRate=1e-4) might be too low or too high. Experiment with different learning rates, such as 1e-3 or 5e-5, to see if the agent's performance improves.

Clip Factor:

The clip factor (ClipFactor=0.02) controls how much the policy is allowed to change at each update. If it's too restrictive, the agent might not learn effectively. Try increasing it to 0.1 or 0.2.

Reward Function:

Ensure your reward function is well-designed and provides sufficient feedback for the agent to learn effectively. If the reward is sparse or doesn't align well with the task objectives, the agent may struggle to learn.

Hope this helps.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Kaustab Pal on 22 Aug 2024

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/2146879-ppo-training-stopped-learning#answer_1503244

Hi @Lloyd,

The yellow line, Q0, in the plot represents the estimate of the discounted long-term reward at the start of each episode, based on the initial observation of the environment. Ideally, as training progresses and if the critic is well-designed and learning effectively, the average Q0 should converge towards the actual discounted long-term reward (depicted by the dark-blue line).

In your case, it seems that around episode 2000, Q0 ceases to improve, indicating that the critic may have stopped learning. This is a common challenge in reinforcement learning. Here are a few suggestions to address this:

Reward function: Ensure that your reward function effectively guides the agent towards the desired behavior. Consider normalizing the rewards before training your agent.
Hyperparameter tuning: Experiment with different values for hyperparameters such as the learning rate, clip factor, and entropy loss weight.
You might want to add more layers to your critic network to enhance its capacity to learn complex information. However, be cautious of overfitting when adding too many layers.

For more information, you can refer to the following documentations:

Options for training reinforcement learning agents: https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rltrainingoptions.html
PPO agents: https://www.mathworks.com/help/reinforcement-learning/ug/proximal-policy-optimization-agents.html

Hope this is helpful.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

PPO training Stopped Learning.

0 Comments
Show -2 older commentsHide -2 older comments

Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

PPO training Stopped Learning.

0 Comments Show -2 older commentsHide -2 older comments

Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments