Getting Started with Reinforcement Learning - MATLAB
Video Player is loading.
Current Time 0:00
Duration 9:30
Loaded: 4.52%
Stream Type LIVE
Remaining Time 9:30
 
1x
  • Chapters
  • descriptions off, selected
  • captions off, selected
      Video length is 9:30

      Getting Started with Reinforcement Learning

      Get started with reinforcement learning and Reinforcement Learning Toolbox™ by walking through an example that trains a quadruped robot to walk. This video covers the basics of reinforcement learning and gives you an idea of what it is like to work with Reinforcement Learning Toolbox. Reinforcement learning is a type of machine learning technique where a computer agent learns to perform a task through repeated trial-and-error interactions with a dynamic environment. Watch this video to learn how Reinforcement Learning Toolbox helps you:

      • Create a reinforcement learning environment in Simulink
      • Synthesize reward signals for training
      • Create neural network policies interactively or programmatically
      • Select and design the appropriate reinforcement learning agent
      • Train your agent and inspect training results
      • Generate C/C++ code for deploying the trained policy

      Published: 3 Feb 2022

      Reinforcement learning is a type of machine learning that determines which actions to take to achieve a specific goal. Consider this model of a quadruped robot. As you can see, this simple model has a torso and four legs. There are four hip and four knee revolute joints controlled by servo motors. How could you design a way to move the joints for the robot to walk? It would be difficult to coordinate that kind of movement using traditional control methods. With reinforcement learning as an alternative approach, we can use an algorithm that teaches itself through trial and error. In this video, you will see the basics of reinforcement learning in MATLAB and Simulink by getting this quadruped robot to walk. Let's get started.

      The core idea behind reinforcement learning is repetitive experimentation. That means there needs to be space to make mistakes and figure out which movements will work, and which movements won't work for walking. Normally you would start by developing a model to simulate your system. You may even have something ready to go that was previously developed, like we do here. If not, that's OK. With MATLAB, Simulink, and Simscape you have a lot of tools for making high fidelity models of your systems dynamics.

      In reinforcement learning, the model and everything outside of the learning algorithm is called the environment. With our environment model in place, we need to choose a reinforcement learning algorithm to use. This is called an agent, and it will make decisions on how to actuate the joints based on measurements from the environment. After a lot of trial and error, the agent learns to make the right choices to achieve successful walking.

      If you navigate to Reinforcement Learning in the library browser, you'll find this RL agent block. Let's add it to our Simulink model. Double clicking the block will open its parameters where you can see that it needs to reference an agent variable in the MATLAB workspace. There are many different agent representations available in the Reinforcement Learning Toolbox to choose from. Before we pick one, we'll need to consider how the environment is modeled. For that, let's step back and look at the RL agent block's inputs and outputs, then we'll be able to plug-in the best choice.

      The agent makes its decisions based on observations. Observations give the agent information about the state of the environment, things like positions, velocities, forces, and any other signal that we select based on what the agent needs to know to be able to walk. For the robot, that means signals, like the vertical and lateral position of the torso, as well as its roll, pitch, and yaw rates are included in the observation signal. Sensor outputs from the environment are used to calculate these values and in total, 44 observation signals are provided to the agent. In MATLAB, we use the rlNumericSpec function to format the observation dimensions, so that the agent knows to expect a 44 by 1 signal.

      Based on the observations, the agent sends actions back to the environment. In this case, there are eight joints to manipulate, so our agent will send an action signal containing eight joint torques back to the quadruped robot model. Just like before, we'll use rlNumericSpec to format the action dimensions in MATLAB so that the agent knows it's an 8 by 1 signal. Both the observations and the actions for the quadruped robot are continuous signals. The movements are observed in 3D space, and the torques can vary in a continuous fashion within the motor's upper and lower limits. Knowing that information, the documentation has a roadmap we can follow for selecting agents.

      Let's start by trying a Deep Deterministic Policy Gradient or DDPG agent. This kind of agent relies on neural networks to learn, to complete the task. MATLAB and Simulink have a number of tools available to help you construct your own networks for deep learning and reinforcement learning. You can also create default networks with the agent commands in MATLAB. Here we will use the networks generated by the createNetworks helper function.

      It's the agent's job to match the observations to actions. During training, the agent adds noise to action signals to promote exploration of the environment. Exploration is the key for reinforcement learning since it lets the agent learn new states that are good or bad for walking. In reinforcement learning, the goal is defined via a reward function. The reward function provides positive feedback for desired behavior, and penalizes undesirable outcomes. The agent learns by trying to maximize the reward it receives.

      The agent learns by measuring the current observations of the robot, takes an action to maximize the expected long term reward and explore the system state space, then measures the next observations of the robot. The reward signal gives an indication of how good the transition from the current state to the next state was. By collecting batches of experiences, the agent can estimate the long term expected reward, which will guide the learning of the policy so that the agent can take actions that achieve the walking behavior we desire.

      Let's start with a reward signal designed to maximize forward velocity. We could say that this is the simplest goal of walking. An equation rewarding forward movement can be modeled with standard math operations, blocks, and Simulink. The last signal to connect to the agent block specifies conditions to stop the simulation, things like the robot falling over or some part of it being below ground. When you train an RL agent, hundreds or thousands of simulations occur, so that there's a lot of test data to learn from. Since not all of the simulations will end in walking, we want to be sure to cut short the bad ones based on termination conditions. With all these pieces in place, we're ready to train.

      Back in MATLAB, we want to set up training options as well as some parameters for our agent. Each one of these alter the behavior of the agent during training. For the full list of options, refer to the documentation. While training, we can use the episode manager to get a feel for how things are going. Remember, the agent is trying to maximize the long term reward that we defined based on what we know about the requirements for walking.

      Once training is done, we can simulate the model to see how the agent performs. Over the course of training, the agent learns how to activate the joints to move the robot and maximize the reward. Ideally, this is by walking forward. But as you can see, the robot is not moving away we want. Realistically, it's unlikely to get a perfect result the first time around. You'll have to iterate through some trial and error, just like the agent does.

      We can modify reward signals to improve overall performance by encouraging or discouraging different behaviors. Let's add terms to reward keeping the torso parallel to the ground, keep the torso at a desired height, discourage early termination, and to penalize excessive effort in the form of high joint torques. Remember that the agent takes an action that affects the environment, and we calculate the reward from the new state of the environment. Now we're including the action in the reward, but that's the previous action that led to this new state. That means we'll need to add a delay block to get the action that got us here.

      As you might have noticed, training takes a long time. Reinforcement learning is naturally time consuming since it needs to learn from all that accumulated experience. We can speed things up by distributing the computational load using Parallel Computing Toolbox. With these changes in place, we can train and simulate again. This time around the agent learned what we hoped it would. The robot walks forward without falling over and doesn't display any movement characteristics that would be considered bad habits or unexpected behaviors interfering with normal walking.

      We've got a trained algorithm ready to walk, so now what do we do with it? From here, we could deploy it to use on hardware. MATLAB Coder and Simulink Coder let you generate C/C++ code to integrate with a real world quadruped robot. We've just set up a reinforcement learning agent that teaches itself how to get a quadruped robot to walk, and you can to. Now that you have a feel for what working with reinforcement learning with MATLAB and Simulink is like, it's time to learn it. The best way to learn reinforcement learning is to work with it. So start up Reinforcement Learning Onramp, which will teach you the basics. It's free and takes just a couple of hours. Welcome to Reinforcement Learning with MATLAB and Simulink.

      View more related videos