How to avoid repeated actions and to manually end episode for a DQN agent?

29 views (last 30 days)
I'm using the reinforcement learning toolbox to design and train a DQN agent. At each time step the task of the agent is to select a location on a gridmap where to move in order to perform mapping of the environment. The action space is a discrete action space composed of 24 actions i.e. possible target points.
The goal is to achieve the mapping of the 85% of the environment. The optimal behaviour for the agent would be to select a point nearby, move to that point and then at the next time step repeat this scheme until goal is achieved.
The problem I'm facing is that during training the agent explores different sequencies of actions for each episode, among which there are very good ones. As the agent becomes greedy it performs correctly the first action at the first time step and then it starts to repeat the same action in loop for the next time steps till the end of the episode failing to complete the mission. It seems like it does not learn a sequence of actions as if the algorithm were designed to make the agent achieve its goal with the less number of steps possibile. Am I missing something? Is there some parameters tuning that can improve this behaviour?
Moreover, I would like to ask how can I end an episode: I implemented a custom step function and I've seen that the flag 'IsDone' allows to end the Episode switching it to true but it also means that the agent has reached the target. What if i want to end the episode if the agent performs an action that in reality would end the episode without complete the mission i.e. without setting true the IsDone flag?
The agent is a DQN agent, the critic and the agent parameters are the default ones. The neural network architecture is the dueling DQN architecture from the original paper.
Thanks in advance for your help!

Answers (1)

Emmanouil Tzorakoleftherakis
From what you are saying, it seems that training has not converged yet. During training, the agent may every now and then behave very well in an episode, but unless this behavior is consistent across multiple back to back episodes (aka average reward), this is not a sign that it has converged. I would try getting the agent to explore more by reducing the epsilon decay rate and epsilon min value. There could be other things going on as well, the most important being that the reward signal does not accurately describe the desired behavior.
For your second question, I don't see how that prevents you from using the IsDone flag. Just put an OR condition and set IsDone to true when target is reached OR when it picks a certain action

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!