**Multiperiod Goal-Based Wealth Management Using Reinforcement Learning**

This example shows a reinforcement learning (RL) approach to maximize the probability of obtaining an investor's wealth goal at the end of the investment horizon. This problem is known in the literature as goal-based wealth management (GBWM). In GBWM, risk is not necessarily measured using the standard deviation, the value-at-risk, or any other common risk metric. Instead, risk is understood as the likelihood of not attaining an investor's goal. This alternative concept of risk implies that, sometimes, in order to increase the probability of attaining an investor’s goal, the optimal portfolio’s traditional risk (that is, standard deviation) must increase if the portfolio is underfunded. In other words, for the investor’s view of risk to *decrease*, the traditional view of risk must *increase* if the portfolio’s wealth is too low.

The purpose of the investment strategy is to determine a dynamic portfolio allocation that maximizes the probability of achieving a wealth goal *G* at the time horizon *T*. The dynamic allocation strategy is the optimal solution of the following multiperiod portfolio optimization problem

$\underset{\left\{\mathit{A}\left(0\right),\mathit{A}\left(1\right),\dots ,\mathit{A}\left(\mathit{T}-1\right)\right\}}{\mathrm{max}}\text{\hspace{0.17em}\hspace{0.17em}}\mathbb{P}\left(\mathit{W}\left(\mathit{T}\right)\ge \mathit{G}\right)$,

where $\mathit{W}\left(\mathit{T}\right)$ is the terminal portfolio wealth and $\mathit{A}\left(\mathit{t}\right)$ are the possible actions and allocations at time $\mathit{t}=0,1,\dots ,\mathit{T}-1$.

To solve this problem, this example follows the GBWM strategy of Das and Varma [1] uses RL to optimize the probability of attaining an investment goal. The goal of RL is to train an agent to complete a task within an unknown environment. The agent receives observations and a reward from the environment and sends actions to the environment. The reward is a measure of how successful an action is with respect to completing the task goal.

The agent contains two components: a policy and a learning algorithm.

The policy is a mapping that selects actions based on the observations from the environment. Typically, the policy is a function approximator with tunable parameters, such as a deep neural network.

The learning algorithm continuously updates the policy parameters based on the actions, observations, and reward. The goal of the learning algorithm is to find an optimal policy that maximizes the cumulative reward received during the task.

In other words, reinforcement learning involves an agent learning the optimal behavior through repeated trial-and-error interactions with the environment without human involvement. For more information on reinforcement learning, see What Is Reinforcement Learning? (Reinforcement Learning Toolbox).

The main advantage of leveraging reinforcement learning is that you can use an unknown environment. That is, in theory, you do not need to make assumptions about the state transition probabilities, you do not need to define the probability that the investment achieves a certain wealth level in a given time period. However, reinforcement learning assumes that there is an *accurate* way to simulate the transition from one state to the next for a given time period.

### Problem Definition

`rng('default')`

Specify the initial wealth.

initialWealth = 100;

Specify the target wealth at the end of the investment horizon.

targetWealth = 200;

Specify a time horizon.

finalTimePeriod = 10;

Define the mean and covariance of the annual returns.

returnMean = [0.05; 0.1; 0.25]; returnCovariance = [0.0025 0 0 ; 0 0.04 0.02 ; 0 0.02 0.25 ];

To solve the problem using RL, you need to define:

Actions — Portfolio weights for investment portfolio.

Observations — Time period $\mathit{t}$ and wealth at time period $\mathit{t}$.

Environment — Model of the evolution of the problem. The environment simulates the next observation after receiving an action and computes its associated reward.

Reward — Part of the environment that measures how well the received action contributes to achieving the task.

Agent — Component trained to complete the task within the environment. The agent is responsible for choosing the actions required to complete the task.

### Define Actions

The action at each rebalancing period is to choose the weights of the investment portfolio for the next time period. In this example, assume that the possible portfolio weights are those of portfolios on the efficient frontier. Also, you can assume that the choice of possible portfolios is fixed throughout the full investment period and that they represent a finite subset of the efficient portfolios. By avoiding working with a set of continuous actions in this way, you simplify the training and easily account for the constraints on the portfolio weights (for example, that they sum to `1`

). However, you do have to assume that the distribution of the returns is time-homogeneous; that is, the underlying mean and covariance of the returns is the same at every time period.

Start by creating a `Portfolio`

(Financial Toolbox) object with the asset information from the Problem Definition.

p = Portfolio(AssetMean=returnMean,AssetCovar=returnCovariance);

Add constraints to the portfolio problem. Use `setBounds`

(Financial Toolbox) to bound the portfolio weights. Individual assets cannot represent more than 50% of the portfolio wealth and shorting is not allowed.

p = setBounds(p,0,0.5);

Use `setBudget`

(Financial Toolbox) to specify that the portfolio must be fully invested.

p = setBudget(p,1,1);

Compute the weights of 15 portfolios on the efficient frontier using `estimateFrontier`

(Financial Toolbox). These portfolios represent the possible actions at each rebalancing period.

numPortfolios = 15; pwgt = estimateFrontier(p,numPortfolios);

Plot the efficient frontier using `plotFrontier`

(Financial Toolbox).

figure [prsk,pret] = plotFrontier(p,pwgt); hold on scatter(prsk,pret,"red","o","LineWidth",2) hold off

Use `rlFiniteSetSpec`

(Reinforcement Learning Toolbox) to create the discrete action space for the reinforcement learning environment.

actionInfo = rlFiniteSetSpec(1:15);

### Define Observations

The observations are defined as elements of the environment that describe important aspects of the current state. In this example the observations consist of the current time period $\mathit{t}$ and the wealth at that time period${\mathit{W}}_{\mathit{t}}$. Use `rlNumericSpec`

(Reinforcement Learning Toolbox) to create the observations space, a $\left(2\times 1\right)$ vector containing two signals: the wealth level and the time period.

% Observation space: % [wealth; timePeriod] observationInfo = rlNumericSpec([2 1]);

**Create Environment Model**

You create the environment model using the following steps:

Define a

`myResetFunction`

function that describes the initial conditions of the environment at the beginning of a training episode.Define a custom

`myStepFunction`

function that describes the dynamics of the environment. This includes how the state changes from a current state given the agent action and how the earned reward is computed by such an action. For the reward, here you can choose either a*sparse reward*function or a*constant return reward*function. At each training time step, the state of the model is updated using the`myStepFunction`

function. This function also determines when a training episode finishes.

For more information see Create Custom Environment Using Step and Reset Functions (Reinforcement Learning Toolbox).

#### Define Reset Function

The `myResetFunction`

function (see Local Functions) sets the wealth level to its initial state, `initialWealth`

, and the time period to `0`

.

resetFcn = @() myResetFunction(initialWealth);

#### State Change Model

The custom `myStepFunction`

(see Local Functions) simulates the wealth evolution from the current time period to the next given the portfolio weights associated to the action selected by the agent. For simplicity, this example uses a geometric Brownian motion to simulate the wealth evolution. So, given a wealth level at time step *t*, the wealth level at time *t*+1 is:

$${\mathit{W}}_{\mathit{t}+1}={\mathit{W}}_{\mathit{t}}\text{\hspace{0.17em}}\mathrm{exp}\left\{\left({\mu}_{\mathit{i}}-\raisebox{1ex}{${\sigma}_{\mathit{i}}^{2}$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right)+{\sigma}_{\mathit{i}}\mathit{Z}\right\}$$,

where $\mathit{Z}$ is a standard normal random variable and ${\mu}_{\mathit{i}}$ and ${\sigma}_{\mathit{i}}$ are the mean and standard deviation of the portfolio associated with the $\mathit{i}$th action.

#### Sparse Reward Function

Since the goal of the problem is to achieve the wealth target by the end of the investment period, you can use a reward function awards a value of `1`

if the goal is achieved at the end of the investment period and `0`

otherwise. See Local Functions for the `sparseReward`

function.

$$\mathit{R}\left(\mathit{W},\mathit{t}\right)=\{\begin{array}{ll}0& \mathrm{if}\text{\hspace{0.17em}}\mathit{t}<\mathit{T}\text{\hspace{0.17em}}\mathrm{or}\text{\hspace{0.17em}}\mathit{W}<\mathit{G}\\ 1& \mathrm{if}\text{\hspace{0.17em}}\mathit{t}=\mathit{T}\text{\hspace{0.17em}}\mathrm{and}\text{\hspace{0.17em}}\mathit{W}\ge \mathit{G}\end{array}$$

% Sparse reward function handle sparseRewardFcnHandle = @(loggedSignals) sparseReward(loggedSignals, ... finalTimePeriod,targetWealth);

Agents defined with a sparse reward are challenging to train because a large amount of states do not return any signal. This issue becomes more apparent as the episodes become long.

#### Constant Return Reward Function

To reduce the training time, you can also use a reward function that tries to lead the agent to achieve a minimum wealth level at each investment period. The `constantReturnReward`

function (see Local Functions) assumes that the returns are the same throughout the entire investment horizon and that the return value satisfies the following inequality:

$${\mathit{W}}_{0}\text{\hspace{0.17em}}{\left(1+\mathit{r}\right)}^{\mathit{T}}\ge \mathit{G}$$.

% Compute r, the constant return needed to satisfy the goal by the % end of the investment horizon constantReturn = ... nthroot(targetWealth/initialWealth,finalTimePeriod) - 1;

When the wealth level ${\mathit{W}}_{\mathit{t}}$ is greater than ${\mathit{W}}_{0}\text{\hspace{0.17em}}{\left(1+\mathit{r}\right)}^{\mathit{t}}$, you give the agent a small reward (in this example, select a reward of `0.1`

). Finally, you still give a reward of `1`

if the agent reaches the goal at the end of the investment period and `0`

otherwise.

Define the constant return reward function handle.

% Reward Function Handle constantReturnRewardFcnHandle = @(loggedSignals) constantReturnReward(loggedSignals,... initialWealth,finalTimePeriod,targetWealth,constantReturn);

#### Episode Termination

The custom step function flags that the episode has finished when the current time period reaches `finalTimePeriod`

.

#### Define Custom Step Function

Select the reward function, then define the step function handle.

% Select a reward function rewardFcnHandle = sparseRewardFcnHandle; % Define step function handle stepFcn = @(action,loggedSignals) myStepFunction(action,... loggedSignals,prsk,pret,finalTimePeriod,rewardFcnHandle);

#### Define Environment

Use `rlFunctionEnv`

(Reinforcement Learning Toolbox) to construct the custom environment using the defined observations, actions, and reset and step functions.

```
gbwmEnvironment = rlFunctionEnv(observationInfo,actionInfo,stepFcn,...
resetFcn);
```

### Define Agent

The purpose of the agent is to select actions that are sent to the environment. The agent then receives new observations from the environment and the reward generated by the submitted actions. The goal of RL is to train the agent to select the best possible actions to maximize the probability of attaining the wealth goal by the end of the investment period.

Reinforcement Learning Toolbox™ provides several built-in agents that you can train in environments with either continuous or discrete observation and action spaces. The table in the section "Built-In Agents" of Reinforcement Learning Agents (Reinforcement Learning Toolbox) summarizes the types of agents for the different types of observation and action spaces.

Given that the action space in this example is discrete and the observation space is continuous, this example uses a Deep Q-Network (DQN) Agents (Reinforcement Learning Toolbox).

Use `rlAgentInitializationOptions`

(Reinforcement Learning Toolbox) to specify the number of neurons in each learnable layer.

initializationOptions =... rlAgentInitializationOptions('NumHiddenUnit',30);

Use `rlDQNAgent`

(Reinforcement Learning Toolbox) to create the agent using the defined action and observation specifications.

```
DQNagent = rlDQNAgent(observationInfo,actionInfo,...
initializationOptions);
```

Set the agent options.

DQNagent.AgentOptions.DiscountFactor = 1; DQNagent.AgentOptions.EpsilonGreedyExploration.EpsilonDecay = 0.001;

### Train Agent

Once you define the agent and the environment, you can use the `train`

(Reinforcement Learning Toolbox) function to train the DQN agent. To configure the options for training, use `rlTrainingOptions`

(Reinforcement Learning Toolbox).

trainingOptions = rlTrainingOptions;

Set the maximum number of episodes to train to `3000`

. If you use a constant return function, training takes longer, so reduce the number of episodes.

trainingOptions.MaxEpisodes = 3000;

Set the window length for averaging the rewards to `100`

. `ScoreAveragingWindowLength`

is the number of episodes included in the average.

trainingOptions.ScoreAveragingWindowLength = 100;

Train the agent. The Reinforcement Learning Episode Manager is opened. When the `sparseReward`

function (see Local Functions) is selected as the reward from the environment, the training time is approximately five minutes. If you set `do_train`

to false, a network pretrained using the `sparserReward`

function is loaded.

do_train = false; if do_train % Train the agent trainingStats = train(DQNagent,gbwmEnvironment,trainingOptions); else load('trainedAgent') end

### Simulate Using Agent

Simulate 1000 scenarios taking the actions of the training agent `DQNagent`

. Use `rlSimulationOptions`

(Reinforcement Learning Toolbox) to set the number of simulations.

numSimulations = 1e3; simulationOptions = rlSimulationOptions(NumSimulations=numSimulations);

Simulate the trained agent using `sim`

(Reinforcement Learning Toolbox).

experience = sim(gbwmEnvironment,DQNagent,simulationOptions);

Obtain the wealth observations per period and the rewards at the end of the episode. These rewards show whether the target wealth is achieved by the end of the investment horizon or not.

% Retrieve simulation information cumulativeReward = 0; wealthSimulation = zeros(finalTimePeriod+1,numSimulations); for i = 1:numSimulations cumulativeReward = cumulativeReward+... experience(i).Reward.Data(end); wealthSimulation(:,i) =... squeeze(experience(i).Observation.obs1.Data(1,1,:)); end

Calculate an empirical approximation of the success rate of the agent's policy.

```
% Compute the testing success probability
successProb = cumulativeReward/numSimulations
```

successProb = 0.7890

For this problem, the GBWM optimal probability, 78.90%, is similar to the one found in Dynamic Portfolio Allocation in Goal-Based Wealth Management for Multiple Time Periods (Financial Toolbox), where the optimal probability is 79.28%. Since both examples use a Brownian motion to simulate the wealth evolution, the solution from the dynamic programming approach is the closest to the theoretical optimum. Thus, the optimal probability of 78.90% achieved with the RL approach is close to the theoretical optimum as well.

% Plot the asset allocation for one simulation simNumber = 332; figure tiledlayout(2,1) % Tile 1: Wealth progress nexttile plot(0:finalTimePeriod,wealthSimulation(:,simNumber)) hold on yline(targetWealth,'--k',LineWidth=2) title('Wealth Evolution') xlabel('Time Period') ylabel('Wealth Level') hold off % Tile 2: Optimal action nexttile bar(0:finalTimePeriod-1,... pwgt(:,experience(simNumber).Action.act1.Data)','stacked') title('Portfolio Evolution') xlabel('Time Period') ylabel('Portfolio Weights') legend('Low risk asset','Medium risk asset','High risk asset')

When you use the sparse reward function, for simulation number 332, from time *t* = `0`

to *t* = `2`

, the wealth evolution is favorable. This means that the initial portfolio is adequate and there is no incentive to change the portfolio allocation. However, at time the wealth decreases considerably. This signals the agent that the portfolio allocation should become more aggressive in order to make up for the loss and ensure that the investment goal is achieved by the end of the investment horizon. Since the wealth level stays below 160 until , the aggressiveness of the portfolio does not decrease. The optimal strategy of the agent is to be aggressive when the wealth is far from the goal and the time period is close to the end of the investment horizon. The entire agent's optimal strategy is illustrated in Heatmap of Optimal Policy.

Using the previous code, when you change the `simNumber`

to `803`

and rerun the example, you obtain the following results. (Note that, as with simulation 332, these results come from a model trained with the sparse reward function.)

Because of the losses in $\mathit{t}=1$, the agent chooses an aggressive allocation early on. In $\mathit{t}=3$, the wealth level increases considerably and that makes the agent choose a less aggressive portfolio. However, because the wealth drops again, the agent returns to its aggressive strategy. Finally, even though the wealth level at $\mathit{t}=7$ is similar to the wealth in $\mathit{t}=3$, the agent still selects an aggressive strategy because the end of the investment horizon is close. These results show how the agent incorporates not only the wealth level but also the time period information into its decisions.

### Heatmap of Optimal Policy

First, get the minimum and maximum wealth levels obtained during the simulation.

```
% Minimum and maximum weights
WMin = min(wealthSimulation(end,:));
WMax = max(wealthSimulation(end,:));
```

Compute a set of possible wealth levels using `25`

logarithmically spaced points between the maximum and minimum. Then, create a grid of possible observations using these levels and the time steps.

% Logarithmic wealth levels numWealthPoints = 25; wealthLevel = logspace(log10(WMin),log10(WMax),numWealthPoints); % Grid of possible observations wVar = repmat(wealthLevel',finalTimePeriod,1); tVar = repelem((0:finalTimePeriod-1)',numWealthPoints); obs = zeros(2,1,finalTimePeriod*numWealthPoints); obs(:,1,:) = [wVar';tVar'];

Compute the action generated by the trained agent for each of the observations in the grid.

```
% Action for each observation
actions = getAction(DQNagent,obs);
actions = squeeze(actions{1});
```

Use `heatmap`

to visualize the action taken by the trained agent for each of the observations.

% Create heatmap figure T = table(wVar,tVar,actions,... VariableNames={'WealthLevel','TimePeriod','Action'}); heatmap(T,'TimePeriod','WealthLevel',ColorVariable='Action',... ColorMethod='none')

This figure shows the actions chosen by the agent for different wealth levels and time periods. The numbers inside the grid represent the action taken by the agent at each state. In this example, when you use the sparse reward function, the trained agent only invests in portfolio #11, #13, or #15, where the higher the portfolio number, the more aggressive the portfolio. When the wealth level is higher, the agent chooses the least risky portfolio and the wealth cutoff increases as the time progresses. If the wealth level is slightly below the wealth that would be achieved by the constant reward assumption (see section Constant Return Reward), the agent's optimal action is to choose the most aggressive portfolio. This choice increases the expected return, even at the cost of increasing the volatility. The extreme choice made by the agent is the result of the objective function not taking into account possible losses. If the investor also wants to take their losses into account, they can add those losses into the objective and reward function to penalize cases where the losses become too high.

### References

[1] Das, S. R., and S. Varma. "Dynamic Goals-Based Wealth Management Using Reinforcement Learning." *Journal Of Investment Management, *18 No.2 (2020): 1-20.

### Local Functions

function [initialObservation,loggedSignals] =... myResetFunction(initialWealth) % Reset function to set wealth and time period to initial state. % Store initial state in loggedSignals loggedSignals.Wealth = initialWealth; loggedSignals.TimePeriod = 0; % Return initial observation initialObservation = [initialWealth; loggedSignals.TimePeriod]; end function [nextObservation,reward,isDone,loggedSignals] =... myStepFunction(action,loggedSignals,prsk,pret,finalTimePeriod,... rewardFunction) % Step function to compute the wealth obtained from investing in % the portfolio specified by the action. The function also % computes the reward obtained by the current action and checks if % episode is done. % Get current wealth and time period W = loggedSignals.Wealth; t = loggedSignals.TimePeriod; % Get risk and return levels for current action mu = pret(action,1); sigma = prsk(action,1); % Compute next wealth Z = randn; W = W*exp(mu-(sigma^2/2)+sigma*Z); % Store next state in LoggedSignal loggedSignals.Wealth = W; loggedSignals.TimePeriod = t+1; % Return observation nextObservation = [loggedSignals.Wealth;loggedSignals.TimePeriod]; % Compute reward reward = rewardFunction(loggedSignals); % Check if episode has ended if loggedSignals.TimePeriod >= finalTimePeriod isDone = true; else isDone = false; end end function reward = sparseReward(loggedSignals,finalTimePeriod,... targetWealth) % Function that computes the reward obtained in the current state % following the sparse rule: % R(W,t) = 1 if t >= T and W >= targetWealth % 0 if t < T or W < targetWealth if loggedSignals.TimePeriod >= finalTimePeriod &&... loggedSignals.Wealth >= targetWealth reward = 1; else reward = 0; end end function reward = constantReturnReward(loggedSignals,... initialWealth,finalTimePeriod,targetWealth,constantReturn) % Function that computes the reward obtained in the current state % following the constant return rule: % R(W,t) = 1 if t >= T and W >= targetWealth % 0.1 if t < T and W >= W0 * (1+r)^t % 0 o.w. % Get current state t = loggedSignals.TimePeriod; W = loggedSignals.Wealth; % Compute reward if t >= finalTimePeriod && W >= targetWealth reward = 1; elseif W >= initialWealth*(1+constantReturn)^t reward = 0.1; else reward = 0; end end

## See Also

`Portfolio`

(Financial Toolbox) | `estimateFrontier`

(Financial Toolbox)

## Related Topics

- Single Period Goal-Based Wealth Management (Financial Toolbox)
- Dynamic Portfolio Allocation in Goal-Based Wealth Management for Multiple Time Periods (Financial Toolbox)