What exactly is Episode Q0? What information is it giving?

220 views (last 30 days)
Reading documentation I find that "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward"
But I cannot grasp exactly what is Q0 because, except in a few examples (like this one) where it "converges" to some value rather quickly, I have seen Q0 value do different things and I cannot understad or interpret them (like the two examples shown here). I also don't understand what "true discounted reward" means exaclty. Is it for each episode, average or something cumulative?
In this answer it is suggested that Q0 should track the average episode reward, but I don't see that in the examples.
For example, in the cartpole example, if one continues the training for more episodes (changing the stop training criteria to avoid stopping for average reward), Q0 value reaches very high values that have nothing to do with the average reward or the episodes. I simulated 1000 episodes for the cartpole example and Q0 values even mess up the scale because they go way too high. The agent seams too learn properly and it even manages to get out of some local minimums sucessfully but still, I cannot grasp what information Q0 yields
I have not found Q0 defined in Reinforcement Learning bibliography either. Could you please clarify a bit or give some bibliogtaphy where I can read further on this specific parameter?

Accepted Answer

Emmanouil Tzorakoleftherakis
Q0 is calculated by performing inference on the critic at the beginning of each episode. Effectively, it is a metric that tells you how well the critic has been trained. If you had the perfect critic that could accurately predict the expected long term reward based on the current observation at the beginning of the episode, this value should overlap with the actual total reward collected during that same episode.
In general, it is not required for this to happen for actor-critic mathods. The actor may converge first and at that point it would be totally fine to stop training.
Hope that helps
  8 Comments
轩
on 30 Dec 2023
I think Q0 is still useful in Actor-Critic methods. Can I take it this way, in the AC algorithm if Q0 does not converge with average reward, that means critic's converging rate is slower than actor, we should set a bigger learning rate or something else.
Jayandi
Jayandi 9 minutes ago
Thank you Emmanouil. I think your definition is correct.
Converging of Q0 and reward average is the best a training could achieve but my case experience it is not necessary. So this is my conclusions:
  1. Q0 should be almost stable, meaning it is around a number, for insant -7.xxxx. Perhaps at beginning it is not but after several epoch it arrives on a stable number. Q0 which is going upward and downward, not stable in a number, perhaps required several adjusment in hyperparameters, networks, or else.
  2. A stable Q0 and a stable average rewards, for my case, is enough to say the training is converged. It is proven while testing of the agent afterward.
  3. Of course, it is always tempting to make changes as far as possible to make the avg rewards and Q0 converge, but I don't have luck yet. If anyone could share the recipe, I would glad to hear.
  4. In short, Q0 is helping to check our training conditions.

Sign in to comment.

More Answers (0)

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!