Main Content

sample

Sample experiences from replay memory buffer

    Description

    example

    experience = sample(buffer,batchSize) returns a mini-batch of N experiences from the replay memory buffer, where N is specified using batchSize.

    experience = sample(buffer,batchSize,Name=Value) specifies additional sampling options using one or more name-value pair arguments.

    [experience,Mask] = sample(buffer,batchSize,Name=Value) specifies additional sampling options using one or more name-value pair arguments.

    Examples

    collapse all

    Define observation specifications for the environment. For this example, assume that the environment has a single observation channel with three continuous signals in specified ranges.

    obsInfo = rlNumericSpec([3 1],...
        LowerLimit=0,...
        UpperLimit=[1;5;10]);

    Define action specifications for the environment. For this example, assume that the environment has a single action channel with two continuous signals in specified ranges.

    actInfo = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[5;10]);

    Create an experience buffer with a maximum length of 20,000.

    buffer = rlReplayMemory(obsInfo,actInfo,20000);

    Append a single experience to the buffer using a structure. Each experience contains the following elements: current observation, action, next observation, reward, and is-done.

    For this example, create an experience with random observation, action, and reward values. Indicate that this experience is not a terminal condition by setting the IsDone value to 0.

    exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
    exp.Action = {actInfo.UpperLimit.*rand(2,1)};
    exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
    exp.Reward = 10*rand(1);
    exp.IsDone = 0;

    Append the experience to the buffer.

    append(buffer,exp);

    You can also append a batch of experiences to the experience buffer using a structure array. For this example, append a sequence of 100 random experiences, with the final experience representing a terminal condition.

    for i = 1:100
        expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
        expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
        expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
        expBatch(i).Reward = 10*rand(1);
        expBatch(i).IsDone = 0;
    end
    expBatch(100).IsDone = 1;
    
    append(buffer,expBatch);

    After appending experiences to the buffer, you can sample mini-batches of experiences for training your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

    miniBatch = sample(buffer,50);

    You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive experiences with a discount factor of 0.95.

    horizonSample = sample(buffer,1,...
        NStepHorizon=10,...
        DiscountFactor=0.95);

    The returned sample includes the following information.

    • Observation and Action are the observation and action from the first experience in the horizon.

    • NextObservation and IsDone are the next observation and termination signal from the final experience in the horizon.

    • Reward is the cumulative reward across the horizon using the specified discount factor.

    You can also sample a sequence of consecutive experiences. In this case, the structure fields contain arrays with values for all sampled experiences.

    sequenceSample = sample(buffer,1,...
        SequenceLength=20);

    Define observation specifications for the environment. For this example, assume that the environment has a two observations channel: one channel with two continuous observations and one channel with a three-valued discrete observation.

    obsContinuous = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[1;5]);
    obsDiscrete = rlFiniteSetSpec([1 2 3]);
    obsInfo = [obsContinuous obsDiscrete];

    Define action specifications for the environment. For this example, assume that the environment has a single action channel with one continuous action in a specified range.

    actInfo = rlNumericSpec([2 1],...
        LowerLimit=0,...
        UpperLimit=[5;10]);

    Create an experience buffer with a maximum length of 5,000.

    buffer = rlReplayMemory(obsInfo,actInfo,5000);

    Append a sequence of 50 random experiences to the buffer.

    for i = 1:50
        exp(i).Observation = ...
            {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
        exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
        exp(i).NextObservation = ...
            {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
        exp(i).Reward = 10*rand(1);
        exp(i).IsDone = 0;
    end
    
    append(buffer,exp);

    After appending experiences to the buffer, you can sample mini-batches of experiences for training your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

    miniBatch = sample(buffer,10);

    Input Arguments

    collapse all

    Experience buffer, specified as an rlReplayMemory object.

    Batch size of experiences to sample, specified as a positive integer.

    If batchSize is greater than the current length of the buffer, then sample returns no experiences.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: DiscountFactor=0.95

    Sequence length, specified as a positive integer. For each batch element, sample up to SequenceLength consecutive experiences. If a sampled experience has a nonzero IsDone value, stop the sequence at that experience.

    N-step horizon length, specified as a positive integer. For each batch element, sample up to NStepHorizon consecutive experiences. If a sampled experience has a nonzero IsDone value, stop the horizon at that experience. Return the following experience information based on the sampled horizon.

    • Observation and Action values from the first experience in the horizon

    • NextObservation and IsDone values from the final experience in the horizon.

    • Cumulative reward across the horizon using the specified discount factor, DiscountFactor.

    If an experience in the horizon has a nonzero IsDone value,

    Sampling an n-step horizon is not supported when sampling sequences. Therefore, if SequenceLength > 1, then NStepHorizon must be 1.

    Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a horizon of experiences (NStepHorizon > 1), sample returns the cumulative reward R computed as follows.

    R=i=1NγiRi

    Here:

    • γ is the discount factor.

    • N is the sampled horizon length, which can be less than NStepHorizon.

    • Ri is the reward for the ith horizon step.

    DiscountFactor applies only when NStepHorizon is greater than one.

    Data source index, specified as an one of the following:

    • -1 — Sample from the experiences of all data sources.

    • Nonnegative integer — Sample from the experiences of only the data source specified by DataSourceID.

    Output Arguments

    collapse all

    Experience sampled from the buffer, returned as a structure with the following fields.

    Starting state, returned as a cell array with length equal to the number of observation specifications specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-SequenceLength array, where DO is the dimension of the corresponding observation specification.

    Agent action from starting state, returned as a cell array with length equal to the number of action specifications specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-SequenceLength array, where DA is the dimension of the corresponding action specification.

    Reward value obtained by taking the specified action from the starting state, returned as a 1-by-1-by-SequenceLength array.

    Next state reached by taking the specified action from the starting state, returned as a cell array with the same format as Observation.

    Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of IsDone has one of the following values.

    • 0 — This experience is not the end of an episode.

    • 1 — The episode terminated because the environment generated a termination signal.

    • 2 — The episode terminated by reaching the maximum episode length.

    Sequence padding mask, returned as a logical array with length equal to SequenceLength. When the sampled sequence length is less than SequenceLength, the data returned in experience is padded. Each element of Mask is true for a real experience and false for a padded experience.

    You can ignore Mask when SequenceLength is 1.

    Version History

    Introduced in R2022a