# rlSACAgentOptions

Options for SAC agent

## Description

Use an `rlSACAgentOptions`

object to specify options for soft
actor-critic (SAC) agents. To create a SAC agent, use `rlSACAgent`

.

For more information, see Soft Actor-Critic Agents.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

## Creation

### Description

creates an options
object for use as an argument when creating a SAC agent using all default options. You can
modify the object properties using dot notation.`opt`

= rlSACAgentOptions

sets option properties using
name-value pairs. For example, `opt`

= rlSACAgentOptions(`Name,Value`

)`rlSACAgentOptions('DiscountFactor',0.95)`

creates an option set with a discount factor of `0.95`

. You can specify
multiple name-value pairs. Enclose each property name in quotes.

## Properties

`EntropyWeightOptions`

— Entropy tuning options

`EntropyWeightOptions`

object

Entropy tuning options, specified as an `EntropyWeightOptions`

object with the following properties.

`EntropyWeight`

— Initial entropy component weight

`1`

(default) | positive scalar

Initial entropy component weight, specified as a positive scalar.

`LearnRate`

— Optimizer learning rate

`3e-4`

(default) | nonnegative scalar

Optimizer learning rate, specified as a nonnegative scalar. If
`LearnRate`

is zero, the `EntropyWeight`

value is fixed during training and the `TargetEntropy`

value is
ignored.

`TargetEntropy`

— Target entropy value

`[]`

(default) | scalar

Target entropy value for tuning entropy weight, specified as a scalar. A higher target entropy value encourages more exploration.

If you do not specify `TargetEntropy`

, the agent uses
–*A* as the target value, where *A* is the
number of actions.

`Algorithm`

— Algorithm to tune entropy

`"adam"`

(default) | `"sgdm"`

| `"rmsprop"`

Algorithm to tune entropy, specified as one of the following strings.

`"adam"`

— Use the Adam optimizer. You can specify the decay rates of the gradient and squared gradient moving averages using the`GradientDecayFactor`

and`SquaredGradientDecayFactor`

fields of the`OptimizerParameters`

option.`"sgdm"`

— Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify the momentum value using the`Momentum`

field of the`OptimizerParameters`

option.`"rmsprop"`

— Use the RMSProp optimizer. You can specify the decay rate of the squared gradient moving average using the`SquaredGradientDecayFactor`

fields of the`OptimizerParameters`

option.

For more information about these optimizers, see Stochastic Gradient Descent in Deep Learning Toolbox™.

`GradientThreshold`

— Threshold value for gradient

`Inf`

(default) | positive scalar

Threshold value for the entropy gradient, specified as `Inf`

or a positive scalar. If the gradient exceeds this value, the gradient is
clipped.

`OptimizerParameters`

— Applicable parameters for optimizer

`OptimizerParameters`

object

Applicable parameters for the optimizer, specified as an
`OptimizerParameters`

object with the following parameters. The
default parameter values work well for most problems.

Parameter | Description | Default |
---|---|---|

`Momentum` | Contribution of previous step, specified as a scalar from 0 to 1. A value of 0 means no contribution from the previous step. A value of 1 means maximal contribution. This parameter applies only
when | `0.9` |

`Epsilon` | Denominator offset, specified as a positive scalar. The optimizer adds this offset to the denominator in the network parameter updates to avoid division by zero. This parameter applies
only when | `1e-8` |

`GradientDecayFactor` | Decay rate of gradient moving average, specified as a positive scalar from 0 to 1. This parameter applies only when
| `0.9` |

`SquaredGradientDecayFactor` | Decay rate of squared gradient moving average, specified as a positive scalar from 0 to 1. This parameter applies only
when | `0.999` |

When a particular property of `OptimizerParameters`

is not
applicable to the optimizer type specified in the `Algorithm`

option, that property is set to `"Not applicable"`

.

To change the default values, access the properties of
`OptimizerParameters`

using dot notation.

opt = rlSACAgentOptions; opt.EntropyWeightOptions.OptimizerParameters.GradientDecayFactor = 0.95;

`PolicyUpdateFrequency`

— Number of steps between actor policy updates

`1`

(default) | positive integer

Number of steps between actor policy updates, specified as a positive integer. For more information, see Training Algorithm.

`CriticUpdateFrequency`

— Number of steps between critic updates

`1`

(default) | positive integer

Number of steps between critic updates, specified as a positive integer. For more information, see Training Algorithm.

`NumWarmStartSteps`

— Number of actions to take before updating actor and critic

positive integer

Number of actions to take before updating actor and critics, specified as a positive
integer. By default, the `NumWarmStartSteps`

value is equal to the
`MiniBatchSize`

value.

`NumGradientStepsPerUpdate`

— Number of gradient steps when updating actor and critics

`1`

(default) | positive integer

Number of gradient steps to take when updating actor and critics, specified as a positive integer.

`ActorOptimizerOptions`

— Actor optimizer options

`rlOptimizerOptions`

object

Actor optimizer options, specified as an `rlOptimizerOptions`

object. It allows you to specify training parameters of
the actor approximator such as learning rate, gradient threshold, as well as the
optimizer algorithm and its parameters. For more information, see `rlOptimizerOptions`

and `rlOptimizer`

.

`CriticOptimizerOptions`

— Critic optimizer options

`rlOptimizerOptions`

object

Critic optimizer options, specified as an `rlOptimizerOptions`

object. It allows you to specify training parameters of
the critic approximator such as learning rate, gradient threshold, as well as the
optimizer algorithm and its parameters. For more information, see `rlOptimizerOptions`

and `rlOptimizer`

.

`TargetSmoothFactor`

— Smoothing factor for target critic updates

`1e-3`

(default) | positive scalar less than or equal to 1

Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For more information, see Target Update Methods.

`TargetUpdateFrequency`

— Number of steps between target critic updates

`1`

(default) | positive integer

Number of steps between target critic updates, specified as a positive integer. For more information, see Target Update Methods.

`ResetExperienceBufferBeforeTraining`

— Option for clearing the experience buffer

`true`

(default) | `false`

Option for clearing the experience buffer before training, specified as a logical value.

`SequenceLength`

— Maximum batch-training trajectory length when using RNN

`1`

(default) | positive integer

Maximum batch-training trajectory length when using a recurrent neural network,
specified as a positive integer. This value must be greater than `1`

when using a recurrent neural network and `1`

otherwise.

`MiniBatchSize`

— Size of random experience mini-batch

`64`

(default) | positive integer

Size of random experience mini-batch, specified as a positive integer. During each training episode, the agent randomly samples experiences from the experience buffer when computing gradients for updating the actor and critics. Large mini-batches reduce the variance when computing gradients but increase the computational effort.

`NumStepsToLookAhead`

— Number of future rewards used to estimate the value of the policy

`1`

(default) | positive integer

Number of future rewards used to estimate the value of the policy, specified as a positive integer. For more information, see [1], Chapter 7.

Note that if parallel training is enabled (that is if an `rlTrainingOptions`

option object in which the
`UseParallel`

property is set to `true`

is
passed to `train`

) then
`NumStepsToLookAhead`

must be set to `1`

,
otherwise an error is generated. This guarantees that experiences are stored
contiguously.

.

`ExperienceBufferLength`

— Experience buffer size

`10000`

(default) | positive integer

Experience buffer size, specified as a positive integer. During training, the agent computes updates using a mini-batch of experiences randomly sampled from the buffer.

`SampleTime`

— Sample time of agent

`1`

(default) | positive scalar | `-1`

Sample time of agent, specified as a positive scalar or as `-1`

. Setting this
parameter to `-1`

allows for event-based simulations.

Within a Simulink^{®} environment, the RL Agent block
in which the agent is specified to execute every `SampleTime`

seconds
of simulation time. If `SampleTime`

is `-1`

, the
block inherits the sample time from its parent subsystem.

Within a MATLAB^{®} environment, the agent is executed every time the environment advances. In
this case, `SampleTime`

is the time interval between consecutive
elements in the output experience returned by `sim`

or
`train`

. If
`SampleTime`

is `-1`

, the time interval between
consecutive elements in the returned output experience reflects the timing of the event
that triggers the agent execution.

`DiscountFactor`

— Discount factor

`0.99`

(default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or equal to 1.

## Object Functions

`rlSACAgent` | Soft actor-critic reinforcement learning agent |

## Examples

### Create SAC Agent Options Object

Create a SAC agent options object, specifying the discount factor.

`opt = rlSACAgentOptions('DiscountFactor',0.95)`

opt = rlSACAgentOptions with properties: EntropyWeightOptions: [1x1 rl.option.EntropyWeightOptions] PolicyUpdateFrequency: 1 CriticUpdateFrequency: 1 NumWarmStartSteps: 64 NumGradientStepsPerUpdate: 1 ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions] CriticOptimizerOptions: [1x2 rl.option.rlOptimizerOptions] TargetSmoothFactor: 1.0000e-03 TargetUpdateFrequency: 1 ResetExperienceBufferBeforeTraining: 1 SequenceLength: 1 MiniBatchSize: 64 NumStepsToLookAhead: 1 ExperienceBufferLength: 10000 SampleTime: 1 DiscountFactor: 0.9500 InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to `0.5`

.

opt.SampleTime = 0.5;

For SAC agents, configure the entropy weight optimizer using the options in `EntropyWeightOptions`

. For example, set the target entropy value to `–5`

.

opt.EntropyWeightOptions.TargetEntropy = -5;

## References

[1] Sutton, Richard S., and Andrew G.
Barto. *Reinforcement Learning: An Introduction*. Second edition.
Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press, 2018.

## Version History

**Introduced in R2020b**

### R2022a: Simulation and deployment: `UseDeterministicExploitation`

will be removed

The property `UseDeterministicExploitation`

of the
`rlSACAgentOptions`

object will be removed in a future release. Use the
`UseExplorationPolicy`

property of `rlSACAgent`

instead.

Previously, you set `UseDeterministicExploitation`

as follows.

Force the agent to always select the action with maximum likelihood, thereby using a greedy deterministic policy for simulation and deployment.

agent.AgentOptions.UseDeterministicExploitation = true;

Allow the agent to select its action by sampling its probability distribution for simulation and policy deployment, thereby using a stochastic policy that explores the observation space.

agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set `UseExplorationPolicy`

as follows.

Force the agent to always select the action with maximum likelihood, thereby using a greedy deterministic policy for simulation and deployment.

agent.UseExplorationPolicy = false;

Allow the agent to select its action by sampling its probability distribution for simulation and policy deployment, thereby using a stochastic policy that explores the observation space.

agent.UseExplorationPolicy = true;

Similarly to `UseDeterministicExploitation`

,
`UseExplorationPolicy`

affects only simulation and deployment; it does
not affect training.

