sample

Sample experiences from replay memory buffer

Since R2022a

collapse all in page

Syntax

experience = sample(buffer,batchSize)

[experience,Mask] = sample(buffer,batchSize)

___ = sample(buffer,batchSize,Name=Value)

Description

example

experience = sample(buffer,batchSize) returns a mini-batch of N experiences from the replay memory buffer, where N is specified using batchSize.

[experience,Mask] = sample(buffer,batchSize) returns a sequence padding mask indicating which the padded experiences at the end of a sampled sequence.

___ = sample(buffer,batchSize,Name=Value) specifies additional sampling options using one or more name-value pair arguments.

Examples

collapse all

Create Experience Buffer

Open Live Script

Define observation specifications for the environment. For this example, assume that the environment has a single observation channel with three continuous signals in specified ranges.

obsInfo = rlNumericSpec([3 1],...
    LowerLimit=0,...
    UpperLimit=[1;5;10]);

Define action specifications for the environment. For this example, assume that the environment has a single action channel with two continuous signals in specified ranges.

actInfo = rlNumericSpec([2 1],...
    LowerLimit=0,...
    UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 20,000.

buffer = rlReplayMemory(obsInfo,actInfo,20000);

Append a single experience to the buffer using a structure. Each experience contains the following elements: current observation, action, next observation, reward, and is-done.

For this example, create an experience with random observation, action, and reward values. Indicate that this experience is not a terminal condition by setting the IsDone value to 0.

exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.Reward = 10*rand(1);
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.IsDone = 0;

Before appending experience to the buffer, you can validate whether the experience is compatible with the buffer. The validateExperience function generates an error if the experience is incompatible with the buffer.

validateExperience(buffer,exp)

Append the experience to the buffer.

append(buffer,exp);

You can also append a batch of experiences to the experience buffer using a structure array. For this example, append a sequence of 100 random experiences, with the final experience representing a terminal condition.

for i = 1:100
    expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
    expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
    expBatch(i).Reward = 10*rand(1);
    expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
    expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;

validateExperience(buffer,expBatch)

append(buffer,expBatch);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

miniBatch = sample(buffer,50);

You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive experiences with a discount factor of 0.95.

horizonSample = sample(buffer,1,...
    NStepHorizon=10,...
    DiscountFactor=0.95);

The returned sample includes the following information.

Observation and Action are the observation and action from the first experience in the horizon.
NextObservation and IsDone are the next observation and termination signal from the final experience in the horizon.
Reward is the cumulative reward across the horizon using the specified discount factor.

You can also sample a sequence of consecutive experiences. In this case, the structure fields contain arrays with values for all sampled experiences.

sequenceSample = sample(buffer,1,...
    SequenceLength=20);

Create Experience Buffer with Multiple Observation Channels

Open Live Script

Define observation specifications for the environment. For this example, assume that the environment has two observation channels: one channel with two continuous observations and one channel with a three-valued discrete observation.

obsContinuous = rlNumericSpec([2 1],...
    LowerLimit=0,...
    UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];

Define action specifications for the environment. For this example, assume that the environment has a single action channel with one continuous action in a specified range.

actInfo = rlNumericSpec([2 1],...
    LowerLimit=0,...
    UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 5,000.

buffer = rlReplayMemory(obsInfo,actInfo,5000);

Append a sequence of 50 random experiences to the buffer.

for i = 1:50
    exp(i).Observation = ...
        {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
    exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
    exp(i).NextObservation = ...
        {obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
    exp(i).Reward = 10*rand(1);
    exp(i).IsDone = 0;
end

append(buffer,exp);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

miniBatch = sample(buffer,10);

Input Arguments

collapse all

`buffer` — Experience buffer
`rlReplayMemory` object | `rlPrioritizedReplayMemory` object | `rlHindsightReplayMemory` object | `rlHindsightPrioritizedReplayMemory` object

Experience buffer, specified as one of the following replay memory objects.

`batchSize` — Batch size
positive integer

Batch size of experiences to sample, specified as a positive integer.

If batchSize is greater than the current length of the buffer, then sample returns no experiences.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: DiscountFactor=0.95

`SequenceLength` — Sequence length
`1` (default) | positive integer

Sequence length, specified as a positive integer. For each batch element, sample up to SequenceLength consecutive experiences. If a sampled experience has a nonzero IsDone value, stop the sequence at that experience.

`NStepHorizon` — N-step horizon length
`1` (default) | positive integer

N-step horizon length, specified as a positive integer. For each batch element, sample up to NStepHorizon consecutive experiences. If a sampled experience has a nonzero IsDone value, stop the horizon at that experience. Return the following experience information based on the sampled horizon.

Observation and Action values from the first experience in the horizon.
NextObservation and IsDone values from the final experience in the horizon.
Cumulative reward across the horizon using the specified discount factor, DiscountFactor.

Sampling an n-step horizon is not supported when sampling sequences. Therefore, if SequenceLength > 1, then NStepHorizon must be 1.

`DiscountFactor` — Discount factor
`0.99` (default) | nonnegative scalar less than or equal to one

Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a horizon of experiences (NStepHorizon > 1), sample returns the cumulative reward R computed as follows.

$R = \sum_{i = 1}^{N} γ^{i} R_{i}$

Here:

γ is the discount factor.
N is the sampled horizon length, which can be less than NStepHorizon.
R_i is the reward for the ith horizon step.

DiscountFactor applies only when NStepHorizon is greater than one.

`DataSourceID` — Data source index
`-1` (default) | nonnegative integer

Data source index, specified as one of the following:

-1 — Sample from the experiences of all data sources.
Nonnegative integer — Sample from the experiences of only the data source specified by DataSourceID.

`ReturnDlarray` — Option to return output as deep learning array
`false` (default) | `true`

Option to return output as deep learning array, specified as a logical value. When you specify ReturnDlarray as true the fields of experience are dlarray objects.

Example: ReturnDlarray=true

`ReturnGpuArray` — Option to return output as GPU array
`false` (default) | `true`

Option to return output as GPU array, specified as a logical value. When you specify ReturnGPUarray as true the fields of experience are stored in the GPU.

Setting this option to true requires both Parallel Computing Toolbox™ software and a CUDA^® enabled NVIDIA^® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB^®.

Example: ReturnGpuArray=true

Output Arguments

collapse all

`experience` — Experiences sampled from the buffer
structure

Experiences sampled from the buffer, returned as a structure with the following fields.

`Observation` — Observation
cell array

Observation, returned as a cell array with length equal to the number of observation specifications specified when creating the buffer. Each element of Observation contains a D_O-by-batchSize-by-SequenceLength array, where D_O is the dimension of the corresponding observation specification.

`Action` — Agent action
cell array

Agent action, returned as a cell array with length equal to the number of action specifications specified when creating the buffer. Each element of Action contains a D_A-by-batchSize-by-SequenceLength array, where D_A is the dimension of the corresponding action specification.

`Reward` — Reward value
scalar | array

Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-SequenceLength array.

`NextObservation` — Next observation
cell array

Next observation reached by taking the specified action from the observation, returned as a cell array with the same format as Observation.

`IsDone` — Termination signal
integer | array

Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of IsDone has one of the following values.

0 — This experience is not the end of an episode.
1 — The episode terminated because the environment generated a termination signal.
2 — The episode terminated by reaching the maximum episode length.

`Mask` — Sequence padding mask
logical array

Sequence padding mask, returned as a logical array with length equal to SequenceLength. When the sampled sequence length is less than SequenceLength, the data returned in experience is padded. Each element of Mask is true for a real experience and false for a padded experience.

You can ignore Mask when SequenceLength is 1.

Version History

Introduced in R2022a

sample

Syntax

Description

Examples

Create Experience Buffer

Create Experience Buffer with Multiple Observation Channels

Input Arguments

`buffer` — Experience buffer
`rlReplayMemory` object | `rlPrioritizedReplayMemory` object | `rlHindsightReplayMemory` object | `rlHindsightPrioritizedReplayMemory` object

`batchSize` — Batch size
positive integer

Name-Value Arguments

`SequenceLength` — Sequence length
`1` (default) | positive integer

`NStepHorizon` — N-step horizon length
`1` (default) | positive integer

`DiscountFactor` — Discount factor
`0.99` (default) | nonnegative scalar less than or equal to one

`DataSourceID` — Data source index
`-1` (default) | nonnegative integer

`ReturnDlarray` — Option to return output as deep learning array
`false` (default) | `true`

`ReturnGpuArray` — Option to return output as GPU array
`false` (default) | `true`

Output Arguments

`experience` — Experiences sampled from the buffer
structure

`Observation` — Observation
cell array

`Action` — Agent action
cell array

`Reward` — Reward value
scalar | array

`NextObservation` — Next observation
cell array

`IsDone` — Termination signal
integer | array

`Mask` — Sequence padding mask
logical array

Version History

See Also

Functions

Objects

sample

Syntax

Description

Examples

Create Experience Buffer

Create Experience Buffer with Multiple Observation Channels

Input Arguments

buffer — Experience buffer rlReplayMemory object | rlPrioritizedReplayMemory object | rlHindsightReplayMemory object | rlHindsightPrioritizedReplayMemory object

batchSize — Batch size positive integer

Name-Value Arguments

SequenceLength — Sequence length 1 (default) | positive integer

NStepHorizon — N-step horizon length 1 (default) | positive integer

DiscountFactor — Discount factor 0.99 (default) | nonnegative scalar less than or equal to one

DataSourceID — Data source index -1 (default) | nonnegative integer

ReturnDlarray — Option to return output as deep learning array false (default) | true

ReturnGpuArray — Option to return output as GPU array false (default) | true

Output Arguments

experience — Experiences sampled from the buffer structure

Observation — Observation cell array

Action — Agent action cell array

Reward — Reward value scalar | array

NextObservation — Next observation cell array

IsDone — Termination signal integer | array

Mask — Sequence padding mask logical array

Version History

See Also

Functions

Objects

`buffer` — Experience buffer
`rlReplayMemory` object | `rlPrioritizedReplayMemory` object | `rlHindsightReplayMemory` object | `rlHindsightPrioritizedReplayMemory` object

`batchSize` — Batch size
positive integer

`SequenceLength` — Sequence length
`1` (default) | positive integer

`NStepHorizon` — N-step horizon length
`1` (default) | positive integer

`DiscountFactor` — Discount factor
`0.99` (default) | nonnegative scalar less than or equal to one

`DataSourceID` — Data source index
`-1` (default) | nonnegative integer

`ReturnDlarray` — Option to return output as deep learning array
`false` (default) | `true`

`ReturnGpuArray` — Option to return output as GPU array
`false` (default) | `true`

`experience` — Experiences sampled from the buffer
structure

`Observation` — Observation
cell array

`Action` — Agent action
cell array

`Reward` — Reward value
scalar | array

`NextObservation` — Next observation
cell array

`IsDone` — Termination signal
integer | array

`Mask` — Sequence padding mask
logical array