Clear Filters
Clear Filters

Reinforcement Learning toolbox simple Q learning

9 views (last 30 days)
Hello everyone, i am a newbie in reinforcement learning and I am trying to use the Matlab RL toolbox to solve some simple problem.
However, I meet some problem during the process. I follow the document to establish my own Environment (step function and reset function). And I apply a Q learning on it. However, the result I get from RL toolbox is quite different with that I wrote the algorithm by myself. I am trying to use 1 episode - 350 steps simple Q learning with epsilon greedy.
I wonder if anyone could know what caused this problem. Here I post my driver script, there are three section: 1,import environment 2,Q learning from RL toolbox 3,Q learning wrote by myself.
I do not upload my step function and reset function here.
Thank you so much for the help.
clear all
close all
clc
%%
%%section 1: Import Env and Create action/state
IsDone=0;
Observinfo = rlFiniteSetSpec([1 2 3 4]);
Observinfo.Name = 'Swimmer States';
Observinfo.Description = 'state representation';
ActionInfo = rlFiniteSetSpec([1 2]);
ActionInfo.Name = 'Link Action';
env = rlFunctionEnv(Observinfo,ActionInfo,'myStepFunction','myResetFunction');
%%
%%section 2:Q learning using RL toolbox
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
qRepresentation = rlQValueRepresentation(qTable,getObservationInfo(env),getActionInfo(env));
qRepresentation.Options.LearnRate = 1;
agentOpts = rlQAgentOptions;
agentOpts.EpsilonGreedyExploration.Epsilon = .05;
agentOpts.EpsilonGreedyExploration.EpsilonMin = .05;
agentOpts.DiscountFactor=0.7;
qAgent = rlQAgent(qRepresentation,agentOpts);
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 400;
trainOpts.MaxEpisodes= 30;
trainOpts.StopTrainingCriteria = "GlobalStepCount" ;
trainOpts.StopTrainingValue = 350 ;
trainOpts.ScoreAveragingWindowLength = 28;
trainingStats = train(qAgent,env,trainOpts);
critic = getCritic(qAgent);
qtable = getLearnableParameters(critic);
%%
%%section 3:Q learning by myself
InitialObs = reset(env)
Q=zeros(4,2);
alpha=1;
gamma=0.7;
epsilon=0.05;
iter=350;
for i=1:iter;
randcheck=rand;
s=env.LoggedSignals.State
if randcheck>epsilon
a=find(Q(s,:)==max(Q(s,:)));
elseif randcheck<=epsilon
a=randi(2);
end
[NextObs,Reward,IsDone,LoggedSignals] = step(env,a);
s2=LoggedSignals.State
r=Reward;
Q(s,a)=Q(s,a)+alpha*(r+gamma*max(Q(s2,:))-Q(s,a));
end

Answers (1)

arushi
arushi on 21 Dec 2023
Hi Yangzhe,
I understand that you are getting different results in both the cases with the same approach. Given the code you've provided, here are a few potential considerations that might lead to differences in the results between the MATLAB RL toolbox implementation and your custom Q-learning implementation:
Epsilon-Greedy Action Selection:
  • In your custom code, if multiple actions have the same maximum Q-value, the find function will return all indices where this condition is true. If there is more than one, this could lead to unintended behavior because you do not explicitly handle choosing one action from the resulting array.
  • To fix this, you can use randi(length(a)) to randomly select an action from those with the maximum Q-value:
  • Ensure that the MATLAB RL toolbox also picks one action randomly when there is a tie for the maximum Q-value.
Initial Observation:
  • The custom code does not seem to use InitialObs after resetting the environment. Make sure that you're starting from the same initial state in both implementations.
Learning Rate (Alpha):
  • You have set the learning rate to 1 in both implementations, which is fine as long as this is intentional. A learning rate of 1 means that the Q-values are updated to fully reflect the most recent information, without considering the old value.
Stopping Criteria:
  • In the MATLAB RL toolbox code, you've set MaxStepsPerEpisode to 400 and MaxEpisodes to 30 but then stop after 350 global steps. This could potentially result in stopping mid-episode.
  • In your custom code, you run for a flat 350 iterations, which may not correspond to the same number of episodes or steps per episode as in the MATLAB RL toolbox code.
  • Ensure that the number of episodes and steps per episode is consistent between both implementations.
Reward Processing:
  • Verify that the reward is being processed and applied in the same way in both implementations.
Hope these suggestions help.
Thank you

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!