DQN Control for Inverted Pendulum with Reinforcement Learning Toolbox - MATLAB
Video Player is loading.
Current Time 0:00
Duration 6:51
Loaded: 0.56%
Stream Type LIVE
Remaining Time 6:51
 
1x
  • Chapters
  • descriptions off, selected
  • captions off, selected
      Video length is 6:51

      DQN Control for Inverted Pendulum with Reinforcement Learning Toolbox

      Use the Deep Q-Network (DQN) algorithm in Reinforcement Learning Toolbox™ to1) create the environment, 2) create DQN agent, 3) customize policy representation, 4) train DQN agent, 5) verify trained policy, and 6) deploy trained policy with code generation.

      The provided pendulum environment has predefined observations, actions, and reward. The actions include five possible torque values, while the observations include a 50x50 grayscale image as well as the angular rate of the pendulum, and the reward is the distance from the desired upward position. See how the default DQN agent feature automatically constructs a neural network representation of the Q-function, used by the DQN agent to approximate long-term reward. Learn how to use Deep Network Designer app to graphically customize the generated Q-function representation.

      See how you can visualize the pendulum behavior and logged data during training, and monitor training progress. After training is complete, verify the policy in simulation to decide if further training is necessary. If you are happy with the design, deploy the trained policy using automatic code generation.

      Published: 20 Sep 2023

      In this demo, you will learn how to design a controller that balances an inverted pendulum using the Reinforcement Learning Toolbox. To find a controller, we will use the deep q network algorithm, or DQN, to train a deep neural network with pendulum images.

      You will also see how to create this neural network with the Deep Network Designer app from deep Learning Toolbox. It's important to note before we get started that this workflow can be completed with the reinforcement learning Designer app. But for this video, we will focus on the programmatic implementation.

      For this example, the environment has been predefined in MATLAB and includes the pendulum dynamics, the actions, observations, and the reward. First, we load the predefined environment, then we can view some details about the dynamics actions and observations. For example, the observations include a 50 x 50 grayscale image of the pendulum and the angular rate while the action set includes the five possible torque values that can be applied to the system.

      The short-term reward at each time step is given by the squared error from the zero angle and zero angular rate configuration and a penalty term on the control effort. This expression can be viewed as a negative distance metric. The further away the system is from the upward configuration, the smaller the reward. The same is true for the control effort.

      The DQN algorithm will try to maximize the expected long-term reward over multiple time steps to find a controller that optimally inverts the pendulum. For additional implementation details, you can explore the code for the predefined pendulum environment. The next step is to define the structure of the controller.

      In reinforcement learning, the controller is referred to as control policy and the expected long-term reward of taking an action from a given state is given by the so-called Q-value function. The DQN algorithm approximates the Q-value function using a deep neural network. This approximation becomes more accurate through training where the neural network parameters are updated. By estimating the goodness of actions, the neural network plays the role of a critic.

      After training, the critic can evaluate actions based on their long-term reward and can be used to extract the optimal control policy. The critic network in this example has two inputs and one output. The inputs are the pendulum image and angular rate. And the output is the estimated Q-value in vectorized format. This means that for each of the five possible actions a Q-value is approximated.

      Notice that while the pendulum angle is included in the reward function, it's not used as an input to the critic network. As a result, the control policy will not rely on angle measurements to complete the task. The next step is to create our DQN agent. The parameters of the critic and agent can be selected as needed depending on the problem.

      Once we have defined the critic and agent options, we can create a DQN agent using built in functions in two ways. We can either provide a critic network to the DQN agent object, which we will show in a bit. Or we can use the default agent approach. Notice that the only arguments we need to create the default agent are observations, action, info, and the agent options.

      This approach can save you time and effort by generating a default neural network critic under the hood. The generated critic is a good place to start but does not guarantee successful training. So let's see how we can customize it. To customize our critic network, we will use the Deep Network Designer app.

      Using the layers available in the left panel and specifying the properties for each layer on the right, we can update the image and angular rate pass of the network and then connect these in a common path that outputs the vectorized Q-value. If needed, we can generate the equivalent MATLAB code that builds this network. Alternatively, we can create this network programmatically using MATLAB functions for various layers.

      Once the network is complete, we can export it to a workspace variable and use it to represent the critic. With the critic representation in hand, we can update our agent that will use the DQN algorithm to train the critic network. The parameters of the agent remain the same.

      We are now ready to train the critic. The training process involves running simulations to generate data points that will be used to update the parameters of the Deep Q Network. First, we need to specify the training options. For example, here we specify that we want to run the training for at most 5,000 episodes and stop training if the average reward exceeds the provided value.

      New in R2023b, you can also define a custom stop control of training. Also new in R2023b is the ability to automatically evaluate the agent at fixed intervals during training. If we are concerned about convergence or would like to better understand our agent, we can also create a logger to collect agent data during training.

      Throughout training, we can visualize the behavior of the pendulum in each episode. We can also monitor the training progress results using the episode manager. We can stop training at any time and continue from where we left off. Depending on the problem, training can take from minutes to several hours to complete.

      Using parallel Computing Toolbox, we can speed up the training process using GPUs, multi-core CPUs, and computer clusters by selecting the training options. During training, the episode manager will display a red star with the results of the evaluation every 25 episodes since that is the frequency we set. We can also go into the Data Viewer to take a look at our log data. In this case, the critic loss.

      Once training is complete, we can close the episode manager. We can reopen it any time we would like with the command inspect training results. To test our results, we can simulate and verify the control policy extracted from the trained critic network. If necessary, the critic can be trained further picking up from where the previous training session ended. Once the agent is trained, we can deploy the agent by generating code. Using MATLAB or GPU Coder, we can generate C, C++, or Cuda code for deployment of the agent.

      Thanks for watching. And I hope you better understand how the reinforcement learning workflow for creating a controller can be accomplished with MathWorks products. For more information, please refer to the reinforcement Learning Toolbox product page.

      View more related videos