What exactly is Episode Q0? What information is it giving?
220 views (last 30 days)
Cecilia S. on 11 Jun 2021
Reading documentation I find that "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward"
But I cannot grasp exactly what is Q0 because, except in a few examples (like this one) where it "converges" to some value rather quickly, I have seen Q0 value do different things and I cannot understad or interpret them (like the two examples shown here). I also don't understand what "true discounted reward" means exaclty. Is it for each episode, average or something cumulative?
In this answer it is suggested that Q0 should track the average episode reward, but I don't see that in the examples.
For example, in the cartpole example, if one continues the training for more episodes (changing the stop training criteria to avoid stopping for average reward), Q0 value reaches very high values that have nothing to do with the average reward or the episodes. I simulated 1000 episodes for the cartpole example and Q0 values even mess up the scale because they go way too high. The agent seams too learn properly and it even manages to get out of some local minimums sucessfully but still, I cannot grasp what information Q0 yields
I have not found Q0 defined in Reinforcement Learning bibliography either. Could you please clarify a bit or give some bibliogtaphy where I can read further on this specific parameter?
Emmanouil Tzorakoleftherakis on 22 Jun 2021
Q0 is calculated by performing inference on the critic at the beginning of each episode. Effectively, it is a metric that tells you how well the critic has been trained. If you had the perfect critic that could accurately predict the expected long term reward based on the current observation at the beginning of the episode, this value should overlap with the actual total reward collected during that same episode.
In general, it is not required for this to happen for actor-critic mathods. The actor may converge first and at that point it would be totally fine to stop training.
Hope that helps