Main Content

rlEpsilonGreedyPolicy

Policy object to generate discrete epsilon-greedy actions for custom training loops

Since R2022a

    Description

    This object implements an epsilon-greedy policy, which returns either the action that maximizes a discrete action-space Q-value function, with probability 1-Epsilon, or a random action otherwise, given an input observation. You can create an rlEpsilonGreedyPolicy object from an rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent, rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or deploy it for your application. If UseEpsilonGreedyAction is set to 0 the policy is deterministic, therefore in this case it does not explore. This object is not compatible with generatePolicyBlock and generatePolicyFunction. For more information on policies and value functions, see Create Policies and Value Functions.

    Creation

    Description

    example

    policy = rlEpsilonGreedyPolicy(qValueFunction) creates the epsilon-greedy policy object policy from the discrete action-space Q-value function qValueFunction. It also sets the QValueFunction property of policy to the input argument qValueFunction.

    Properties

    expand all

    Discrete action-space Q-value function approximator, specified as an rlQValueFunction or rlVectorQValueFunction object.

    Exploration options, specified as an EpsilonGreedyExploration object. Changing the noise state or any exploration option of an rlEpsilonGreedyPolicy object deployed through code generation is not supported.

    For more information see the EpsilonGreedyExploration property in rlQAgentOptions.

    Option to enable epsilon-greedy actions, specified as a logical value: either true (default, enabling epsilon-greedy actions, which helps exploration) or false (epsilon-greedy actions not enabled). When epsilon-greedy actions are disabled the policy is deterministic and therefore it does not explore.

    Example: false

    Option to enable epsilon decay, specified as a logical value: either true (default, enabling epsilon decay) or false (disabling epsilon decay).

    Example: false

    Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array of such objects. These objects define properties such as the dimensions, data types, and names of the observation channels.

    Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

    Note

    Only one action channel is allowed.

    Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to -1 allows for event-based simulations.

    Within a Simulink® environment, the RL Agent block in which the policy is specified executes every SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time from its parent subsystem.

    Within a MATLAB® environment, the policy is executed every time the environment advances. In this case, SampleTime is the time interval between consecutive elements in the output experience. If SampleTime is -1, the sample time is treated as being equal to 1.

    Example: 0.2

    Object Functions

    getActionObtain action from agent, actor, or policy object given environment observations
    getLearnableParametersObtain learnable parameter values from agent, function approximator, or policy object
    resetReset environment, agent, experience buffer, or policy object
    setLearnableParametersSet learnable parameter values of agent, function approximator, or policy object

    Examples

    collapse all

    Create observation and action specification objects. For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles, and the action space as a finite set consisting of two possible row vectors, [1 0] and [0 1].

    obsInfo = rlNumericSpec([4 1]);
    actInfo = rlFiniteSetSpec({[1 0],[0 1]});

    Alternatively, use getObservationInfo and getActionInfo to extract the specification objects from an environment.

    Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept an observation as input and return a single vector with as many elements as the number of possible discrete actions.

    To model the parametrized vector Q-value function within the critic, use a neural network. Define a single path from the network input to its output as an array of layer objects.

    layers = [ 
        featureInputLayer(prod(obsInfo.Dimension))
        fullyConnectedLayer(10)
        reluLayer
        fullyConnectedLayer(numel(actInfo.Elements)) 
        ];

    Convert the network to a dlnetwork object and display the number of weights.

    model = dlnetwork(layers);
    summary(model)
       Initialized: true
    
       Number of learnables: 72
    
       Inputs:
          1   'input'   4 features
    

    Create a vector Q-value function using model, and the observation and action specifications.

    qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)
    qValueFcn = 
      rlVectorQValueFunction with properties:
    
        ObservationInfo: [1x1 rl.util.rlNumericSpec]
             ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
          Normalization: "none"
              UseDevice: "cpu"
             Learnables: {4x1 cell}
                  State: {0x1 cell}
    
    

    Check the critic with a random observation input.

    getValue(qValueFcn,{rand(obsInfo.Dimension)})
    ans = 2x1 single column vector
    
        0.6486
       -0.3103
    
    

    Create a policy object from qValueFcn.

    policy = rlEpsilonGreedyPolicy(qValueFcn)
    policy = 
      rlEpsilonGreedyPolicy with properties:
    
                QValueFunction: [1x1 rl.function.rlVectorQValueFunction]
            ExplorationOptions: [1x1 rl.option.EpsilonGreedyExploration]
                 Normalization: "none"
        UseEpsilonGreedyAction: 1
            EnableEpsilonDecay: 1
               ObservationInfo: [1x1 rl.util.rlNumericSpec]
                    ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
                    SampleTime: -1
    
    

    Check the policy with a random observation input.

    getAction(policy,{rand(obsInfo.Dimension)})
    ans = 1x1 cell array
        {[1 0]}
    
    

    You can now train the policy with a custom training loop and then deploy it to your application.

    Version History

    Introduced in R2022a