Reinforcement Learning toolbox simple Q learning

Question

YANGZHE LIU on 9 Mar 2022

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1667204-reinforcement-learning-toolbox-simple-q-learning

Answered: arushi on 21 Dec 2023

Hello everyone, i am a newbie in reinforcement learning and I am trying to use the Matlab RL toolbox to solve some simple problem.

However, I meet some problem during the process. I follow the document to establish my own Environment (step function and reset function). And I apply a Q learning on it. However, the result I get from RL toolbox is quite different with that I wrote the algorithm by myself. I am trying to use 1 episode - 350 steps simple Q learning with epsilon greedy.

I wonder if anyone could know what caused this problem. Here I post my driver script, there are three section: 1,import environment 2,Q learning from RL toolbox 3,Q learning wrote by myself.

I do not upload my step function and reset function here.

Thank you so much for the help.

clear all

close all

clc

%%

%%section 1: Import Env and Create action/state

IsDone=0;

Observinfo = rlFiniteSetSpec([1 2 3 4]);

Observinfo.Name = 'Swimmer States';

Observinfo.Description = 'state representation';

ActionInfo = rlFiniteSetSpec([1 2]);

ActionInfo.Name = 'Link Action';

env = rlFunctionEnv(Observinfo,ActionInfo,'myStepFunction','myResetFunction');

%%

%%section 2:Q learning using RL toolbox

qTable = rlTable(getObservationInfo(env),getActionInfo(env));

qRepresentation = rlQValueRepresentation(qTable,getObservationInfo(env),getActionInfo(env));

qRepresentation.Options.LearnRate = 1;

agentOpts = rlQAgentOptions;

agentOpts.EpsilonGreedyExploration.Epsilon = .05;

agentOpts.EpsilonGreedyExploration.EpsilonMin = .05;

agentOpts.DiscountFactor=0.7;

qAgent = rlQAgent(qRepresentation,agentOpts);

trainOpts = rlTrainingOptions;

trainOpts.MaxStepsPerEpisode = 400;

trainOpts.MaxEpisodes= 30;

trainOpts.StopTrainingCriteria = "GlobalStepCount" ;

trainOpts.StopTrainingValue = 350 ;

trainOpts.ScoreAveragingWindowLength = 28;

trainingStats = train(qAgent,env,trainOpts);

critic = getCritic(qAgent);

qtable = getLearnableParameters(critic);

%%

%%section 3:Q learning by myself

InitialObs = reset(env)

Q=zeros(4,2);

alpha=1;

gamma=0.7;

epsilon=0.05;

iter=350;

for i=1:iter;

randcheck=rand;

s=env.LoggedSignals.State

if randcheck>epsilon

a=find(Q(s,:)==max(Q(s,:)));

elseif randcheck<=epsilon

a=randi(2);

end

[NextObs,Reward,IsDone,LoggedSignals] = step(env,a);

s2=LoggedSignals.State

r=Reward;

Q(s,a)=Q(s,a)+alpha*(r+gamma*max(Q(s2,:))-Q(s,a));

end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

arushi on 21 Dec 2023

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1667204-reinforcement-learning-toolbox-simple-q-learning#answer_1375342

Hi Yangzhe,

I understand that you are getting different results in both the cases with the same approach. Given the code you've provided, here are a few potential considerations that might lead to differences in the results between the MATLAB RL toolbox implementation and your custom Q-learning implementation:

Epsilon-Greedy Action Selection:

In your custom code, if multiple actions have the same maximum Q-value, the find function will return all indices where this condition is true. If there is more than one, this could lead to unintended behavior because you do not explicitly handle choosing one action from the resulting array.
To fix this, you can use randi(length(a)) to randomly select an action from those with the maximum Q-value:
Ensure that the MATLAB RL toolbox also picks one action randomly when there is a tie for the maximum Q-value.

Initial Observation:

The custom code does not seem to use InitialObs after resetting the environment. Make sure that you're starting from the same initial state in both implementations.

Learning Rate (Alpha):

You have set the learning rate to 1 in both implementations, which is fine as long as this is intentional. A learning rate of 1 means that the Q-values are updated to fully reflect the most recent information, without considering the old value.

Stopping Criteria:

In the MATLAB RL toolbox code, you've set MaxStepsPerEpisode to 400 and MaxEpisodes to 30 but then stop after 350 global steps. This could potentially result in stopping mid-episode.
In your custom code, you run for a flat 350 iterations, which may not correspond to the same number of episodes or steps per episode as in the MATLAB RL toolbox code.
Ensure that the number of episodes and steps per episode is consistent between both implementations.

Reward Processing:

Verify that the reward is being processed and applied in the same way in both implementations.

Hope these suggestions help.

Thank you

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Reinforcement Learning toolbox simple Q learning

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Reinforcement Learning toolbox simple Q learning

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments