generateHindsightExperiences
Generate hindsight experiences from hindsight experience replay buffer
Since R2023a
Description
generates hindsight experiences from the last trajectory added to the specified hindsight
experience replay memory buffer.experience
= generateHindsightExperiences(buffer
,trajectoryLength
)
Examples
Generate Experiences from Hindsight Replay Memory
When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.
Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [, , , , , ], where:
and are the goal observations.
and are the goal measurements.
and are additional observations.
obsInfo = rlNumericSpec([6 1],...
LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);
Create a specification for a single action.
actInfo = rlNumericSpec([1 1],...
LowerLimit=0,UpperLimit=10);
To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.
goalConditionInfo = {{1,[2 3],1,[4 5]}};
For this example, use hindsightRewardFcn1
as the ground-truth reward function and hindsightIsDoneFcn1
as the termination condition function.
Create the hindsight replay memory buffer.
buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
@hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);
As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.
for i = 1:10 exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Action = {actInfo.UpperLimit.*rand(1)}; exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Reward = 10*rand(1); exp(i).IsDone = 0; end exp(10).IsDone = 1; append(buffer,exp);
At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.
newExp = generateHindsightExperiences(buffer,10);
For each experience in the final trajectory, the default "final"
sampling strategy generates a new experience where it replaces the goals in Observation
and NextObservation
with the goal measurements from the final experience in the trajectory.
To validate this behavior, first view the final goal measurements from exp
.
exp(10).NextObservation{1}(2:3)
ans = 2×1
0.7277
0.6803
Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.
newExp(6).Observation{1}(4:5)
ans = 2×1
0.7277
0.6803
After generating the new experiences, append them to the buffer.
append(buffer,newExp);
Input Arguments
buffer
— Hindsight experience buffer
rlHindsightReplayMemory
object | rlHindsightPrioritizedReplayMemory
object
Hindsight experience buffer, specified as one of the following replay memory objects.
trajectoryLength
— Length of last trajectory in buffer
positive integer object
Length of last trajectory in buffer, specified as a positive integer.
Output Arguments
experience
— Hindsight experiences generated from the buffer
structure
Experiences sampled from the buffer, returned as a structure with the following fields.
Observation
— Observation
cell array
Observation, returned as a cell array with length equal to the number of
observation specifications specified when creating the buffer. Each element of
Observation
contains a
DO-by-batchSize
-by-SequenceLength
array, where DO is the dimension of the
corresponding observation specification.
Action
— Agent action
cell array
Agent action, returned as a cell array with length equal to the number of
action specifications specified when creating the buffer. Each element of
Action
contains a
DA-by-batchSize
-by-SequenceLength
array, where DA is the dimension of the
corresponding action specification.
Reward
— Reward value
scalar | array
Reward value obtained by taking the specified action from the observation,
returned as a 1-by-1-by-SequenceLength
array.
NextObservation
— Next observation
cell array
Next observation reached by taking the specified action from the observation,
returned as a cell array with the same format as
Observation
.
IsDone
— Termination signal
integer | array
Termination signal, returned as a
1-by-1-by-SequenceLength
array of integers. Each element of
IsDone
has one of the following values.
0
— This experience is not the end of an episode.1
— The episode terminated because the environment generated a termination signal.2
— The episode terminated by reaching the maximum episode length.
trajectoryLength
— Length of last trajectory in experience buffer
positive integer
Length of last trajectory in experience buffer, specified as a positive integer.
Version History
Introduced in R2023a
See Also
Functions
Objects
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)