Main Content

rlReplayMemory

Replay memory experience buffer

Since R2022a

    Description

    An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

    During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

    • S is the current observation of the environment.

    • A is the action taken by the agent.

    • R is the reward for taking action A.

    • S' is the next observation after taking action A.

    • D is the is-done signal after taking action A.

    The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

    By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. Agents uniformly sample data from this buffer.

    You can replace the default experience buffer using one of the following alternative buffer objects.

    When you create a custom off-policy reinforcement learning agent, you can create an experience buffer using an rlReplayMemory object.

    Creation

    Description

    buffer = rlReplayMemory(obsInfo,actInfo) creates a replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively.

    buffer = rlReplayMemory(obsInfo,actInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

    example

    Input Arguments

    expand all

    Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

    You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

    You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

    Properties

    expand all

    This property is read-only.

    Maximum buffer length, specified as a nonnegative integer.

    To change the maximum buffer length, use the resize function.

    This property is read-only.

    Number of experiences in buffer, specified as a nonnegative integer.

    Object Functions

    appendAppend experiences to replay memory buffer
    sampleSample experiences from replay memory buffer
    resizeResize replay memory experience buffer
    resetReset environment, agent, experience buffer, or policy object
    allExperiencesReturn all experiences in replay memory buffer
    validateExperienceValidate experiences for replay memory
    getActionInfoObtain action data specifications from reinforcement learning environment, agent, or experience buffer
    getObservationInfoObtain observation data specifications from reinforcement learning environment, agent, or experience buffer

    Examples

    collapse all

    Define observation specifications for the environment. For this example, assume that the environment has a single observation channel with three continuous signals in specified ranges.

    obsInfo = rlNumericSpec([3 1],...
        LowerLimit=0,...
        UpperLimit=[1;5;10]);

    Define action specifications for the environment. For this example, assume that the environment has a single action channel with two continuous signals in specified ranges.

    actInfo = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[5;10]);

    Create an experience buffer with a maximum length of 20,000.

    buffer = rlReplayMemory(obsInfo,actInfo,20000);

    Append a single experience to the buffer using a structure. Each experience contains the following elements: current observation, action, next observation, reward, and is-done.

    For this example, create an experience with random observation, action, and reward values. Indicate that this experience is not a terminal condition by setting the IsDone value to 0.

    exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
    exp.Action = {actInfo.UpperLimit.*rand(2,1)};
    exp.Reward = 10*rand(1);
    exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
    exp.IsDone = 0;

    Before appending experience to the buffer, you can validate whether the experience is compatible with the buffer. The validateExperience function generates an error if the experience is incompatible with the buffer.

    validateExperience(buffer,exp)

    Append the experience to the buffer.

    append(buffer,exp);

    You can also append a batch of experiences to the experience buffer using a structure array. For this example, append a sequence of 100 random experiences, with the final experience representing a terminal condition.

    for i = 1:100
        expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
        expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
        expBatch(i).Reward = 10*rand(1);
        expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
        expBatch(i).IsDone = 0;
    end
    expBatch(100).IsDone = 1;
    
    validateExperience(buffer,expBatch)
    
    append(buffer,expBatch);

    After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

    miniBatch = sample(buffer,50);

    You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive experiences with a discount factor of 0.95.

    horizonSample = sample(buffer,1,...
        NStepHorizon=10,...
        DiscountFactor=0.95);

    The returned sample includes the following information.

    • Observation and Action are the observation and action from the first experience in the horizon.

    • NextObservation and IsDone are the next observation and termination signal from the final experience in the horizon.

    • Reward is the cumulative reward across the horizon using the specified discount factor.

    You can also sample a sequence of consecutive experiences. In this case, the structure fields contain arrays with values for all sampled experiences.

    sequenceSample = sample(buffer,1,...
        SequenceLength=20);

    Define observation specifications for the environment. For this example, assume that the environment has two observation channels: one channel with two continuous observations and one channel with a three-valued discrete observation.

    obsContinuous = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[1;5]);
    obsDiscrete = rlFiniteSetSpec([1 2 3]);
    obsInfo = [obsContinuous obsDiscrete];

    Define action specifications for the environment. For this example, assume that the environment has a single action channel with one continuous action in a specified range.

    actInfo = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[5;10]);

    Create an experience buffer with a maximum length of 5,000.

    buffer = rlReplayMemory(obsInfo,actInfo,5000);

    Append a sequence of 50 random experiences to the buffer.

    for i = 1:50
        exp(i).Observation = ...
            {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
        exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
        exp(i).NextObservation = ...
            {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
        exp(i).Reward = 10*rand(1);
        exp(i).IsDone = 0;
    end
    
    append(buffer,exp);

    After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

    miniBatch = sample(buffer,10);

    Create an environment for training the agent. For this example, load a predefined environment.

    env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

    Extract the observation and action specifications from the agent.

    obsInfo = getObservationInfo(env);
    actInfo = getActionInfo(env);

    Create a DQN agent from the environment specifications.

    agent = rlDQNAgent(obsInfo,actInfo);

    By default, the agent uses an experience buffer with a maximum size of 10,000.

    agent.ExperienceBuffer
    ans = 
      rlReplayMemory with properties:
    
        MaxLength: 10000
           Length: 0
    
    

    Increase the maximum size of the experience buffer to 20,000.

    resize(agent.ExperienceBuffer,20000)

    View the updated experience buffer.

    agent.ExperienceBuffer
    ans = 
      rlReplayMemory with properties:
    
        MaxLength: 20000
           Length: 0
    
    

    Create an environment for training the agent. For this example, load a predefined environment.

    env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

    Extract the observation and action specifications from the agent.

    obsInfo = getObservationInfo(env);
    actInfo = getActionInfo(env);

    Create a DQN agent from the environment specifications.

    agent = rlDQNAgent(obsInfo,actInfo);

    Display the default experience buffer.

    agent.ExperienceBuffer
    ans = 
      rlReplayMemory with properties:
    
        MaxLength: 10000
           Length: 0
    
    

    Create a new experience buffer.

    new_buffer = rlReplayMemory(obsInfo,actInfo,20000);

    Replace the experience buffer in the agent.

    agent.ExperienceBuffer = new_buffer;

    Display the new experience buffer.

    agent.ExperienceBuffer
    ans = 
      rlReplayMemory with properties:
    
        MaxLength: 20000
           Length: 0
    
    

    Display the dimensions of the observation channels.

    obsInfo.Dimension
    ans = 1×2
    
        50    50
    
    
    ans = 1×2
    
         1     1
    
    

    Check the agent using a random input observation.

    getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
    ans = 1x1 cell array
        {[1]}
    
    

    Version History

    Introduced in R2022a