Train DQN Agent with LSTM Network to Control House Heating System

This example uses:

This example shows how to train a deep Q-learning network (DQN) agent with a Long Short-Term Memory (LSTM) network to control a house heating system modeled in Simscape™. For more information on DQN agents, see Deep Q-Network (DQN) Agent.

Fix Random Number Seed to Improve Reproducibility

The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.

Fix the random number stream with the seed 0 and random number algorithm Mersenne twister. For more information on random number generation, see rng.

previousRngState = rng(0, "twister");

The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.

House Heating Model

The reinforcement learning (RL) environment for this example uses a model from the House Heating System (Simscape) example. The model in this example contains a heater, a thermostat controlled by an RL agent, a house, outside temperatures, and a reward function. Heat is transferred between the outside environment and the interior of the home through the walls, windows, and roof. Weather station data between March 21 and April 15, 2022, from the MathWorks® campus in Natick, MA, is used to simulate the outside temperature. ThingSpeak™ was used to obtain this data. The data file temperatureMar21toApr15_20022.mat, is located in this example folder. For more information about the data acquisition, see Compare Temperature Data from Three Different Days (ThingSpeak).

The training goal for the agent is to minimize the energy cost and maximize the comfort of the room by turning on/off the heater. The house is comfortable when the room temperature $T_{room}$ is between $T_{comfortMin}$ and $T_{comfortMax}$ .

The reinforcement learning problem is defined as follows:

The observation is a 6-dimensional column vector that consists of the room temperature ( $^{\circ} C$ ), outside temperature ( $^{\circ} C$ ), maximum comfort temperature ( $^{\circ} C$ ), minimum comfort temperature ( $^{\circ} C$ ), last action, and price per kWh (USD). The maximum and minimum comfort temperatures and price per kWh in this example do not change over time, and you do not need to use them to train the agent. However, you can extend this example by varying these values over time.
The action is discrete, and represents either turning the heater off (action is 0) or turning it on (action is 1).
The total reward $R$ is calculated as the sum of the comfort reward $Rc$ and a penalty for switching $Ps$ , to which an energy cost $Ce$ is subtracted: $R = Rc + Ps - Ce$ . These three components are defined as follows.

$Rc = {\begin{cases} 0.1 & if {T_{comfortMin} \leq T}_{room} \leq T_{comfortMax} \\ - w | T_{room} - T_{comfortMin} | & if T_{room} < T_{comfortMin} \\ - w | T_{room} - T_{comfortMax} | & if T_{room} > T_{comfortMax} \end{cases}$

where $w = 0.1$ , $T_{comfortMin} = 18,$ and $T_{comfortMax} = 23$ .

$Ps = {\begin{cases} - 0.01 & if a_{t} \neq a_{t - 1} where a_{t} is the current action and a_{t - 1} is the previous action \\ 0 & otherwise \end{cases}$

$Ce = CostPerStep = PricePerKwh * ElectricityUsed$

The reward function is inspired by [1].

Is-Done signal is always 0, indicating that there is no early termination condition.

Define Simulink Environment

Open the model.

mdl = "rlHouseHeatingSystem";
open_system(mdl)

Define a sample time and the maximum number of steps per episode. Both variables are needed in the script and in the reset function.

sampleTime = 120; % seconds
maxStepsPerEpisode = 1000;

Assign the agent block path information to the agentBlk variable, for later use.

agentBlk = mdl + "/Smart Thermostat/RL Agent";

Load the outside temperature data to simulate the environment temperature.

data = load('temperatureMar21toApr15_2022.mat');
temperatureData = data.temperatureData;

Extract validation data.

temperatureMarch21 = temperatureData(1:60*24,:);
temperatureApril15 = temperatureData(end-60*24+1:end,:);

Extract training data.

 temperatureData = temperatureData(60*24+1:end-60*24,:);

The Simulink® model loads the following variables as a part of observations.

outsideTemperature = temperatureData;
comfortMax = 23;
comfortMin = 18;

Define observation and action specifications.

obsInfo = rlNumericSpec([6,1]);
actInfo = rlFiniteSetSpec([0,1]); % (0=off,1=on)

Create an environment interface for the house heating environment.

env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);

Use hRLHeatingSystemResetFcn to reset the environment at the beginning of each episode. This function randomly selects the time between March 22nd and April 14th. The environment uses this time as the initial time for the outside temperatures.

env.ResetFcn = @(in) hRLHeatingSystemResetFcn(in);

Create DQN Agent with LSTM Network

DQN agents use a parametrized Q-value function approximator to estimate the value of the policy. Since DQN agents have a discrete action space, you have the option to create a vector (that is multi-output) Q-value function critic, which is generally more efficient than a comparable single-output critic.

A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward when an agent starts from the state corresponding to the given observation and executes the action corresponding to the element number (and follows a given policy afterwards).

Fix the random number stream.

rng(0, "twister");

Specify training options for the critic and the actor using rlOptimizerOptions.

criticOpts = rlOptimizerOptions( ...
    LearnRate=0.001, ...
    GradientThreshold=1);

Specify the DQN agent options using rlDQNAgentOptions, include the training options for the critic.

agentOpts = rlDQNAgentOptions(...
    UseDoubleDQN = false, ...    
    TargetSmoothFactor = 1, ...
    TargetUpdateFrequency = 4, ...   
    ExperienceBufferLength = 1e6, ...
    CriticOptimizerOptions = criticOpts, ...
    MiniBatchSize = 64);

You can also set or modify the agent options using dot notation.

agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.0001;

To model the parametrized Q-value function within the critic, use a recurrent neural network, which can capture the effect of previous observations. By setting the UseRNN option in rlAgentInitializationOptions, you can create a default DQN agent with an LSTM network. Alternatively, you can manually configure the LSTM network. See Water Distribution System Scheduling Using Reinforcement Learning to create an LSTM network for the DQN agent manually.

useRNN = true;
initOpts = rlAgentInitializationOptions( ...
    UseRNN=useRNN, ...
    NumHiddenUnit=64);

When using a recurrent neural network, you must set SequenceLength to be greater than 1 in rlDQNAgentOptions. This option is used in training to determine the length of the minibatch used to calculate the gradient. For more information about LSTM layers, see Long Short-Term Memory Neural Networks.

if useRNN
    agentOpts.SequenceLength = 20;
end

Create a DQN agent with default critic network using rlDQNAgent, the agent initialization options object (which specify that a recurrent network is needed), and the agent options object.

agent = rlDQNAgent(obsInfo, actInfo, initOpts, agentOpts);

You can also modify the agent options, including the critic options, using dot notation. For example, specify the agent sample time.

agent.SampleTime = sampleTime;

Train Agent

To train the agent, first specify the training options. For this example, use the following options.

Run training for at most 500 episodes, with each episode lasting 1000 time steps.
Set the Plots option to "training-progress", which displays training progress in the Reinforcement Learning Training Monitor
Set the Verbose option to false to disable the command line display.
Stop training when the agent receives a mean evaluation episode reward greater than 85.

For more information, see rlTrainingOptions.

trainOpts = rlTrainingOptions(...
    MaxEpisodes = 500, ...
    MaxStepsPerEpisode = maxStepsPerEpisode, ...
    ScoreAveragingWindowLength = 5,...    
    Verbose = false, ...
    Plots = "training-progress",...
    StopTrainingCriteria = "EvaluationStatistic",...
    StopTrainingValue = 85);

Fix the random number stream.

rng(0, "twister");

Train the agent using the train function. Training this agent is a computationally-intensive process that takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    % Evaluate the agent every 25 training episodes.
    evl = rlEvaluator( ...
        EvaluationFrequency=25, ...
        NumEpisodes=3, ...
        RandomSeeds=[1:3]);
    % Train the agent.
    trainingStats = train(agent,env,trainOpts,Evaluator=evl);
else
    % Load the pretrained agent for the example.
    load("HeatControlDQNAgent.mat","agent")
end

Simulate DQN Agent

To validate the performance of the trained agent, simulate it within the house heating system. For more information on agent simulation, see rlSimulationOptions and sim.

Fix the random number stream.

rng(0, "twister");

First, evaluate the agent's performance using the temperature data from March 21st, 2022. The agent did not use this temperature data for training.

maxSteps= 720;
validationTemperature = temperatureMarch21;
env.ResetFcn = @(in) hRLHeatingSystemValidateResetFcn(in);
simOptions = rlSimulationOptions(MaxSteps = maxSteps);
experience1 = sim(env,agent,simOptions);

Use the localPlotResults function, provided at the end of the example, to analyze the performance.

localPlotResults(experience1, maxSteps, ...
    comfortMax, comfortMin, sampleTime,1)

$Figure contains 3 axes objects. Axes object 1 with title Temperatures, xlabel Time (minutes), ylabel Temperature contains 4 objects of type line, constantline. These objects represent T_{room}, T_{outside}, T_{comfortMin}, T_{comfortMax}. Axes object 2 with title Total Cost, xlabel Time (minutes), ylabel Energy cost contains an object of type line. Axes object 3 with title Cost per step, xlabel Time (minutes), ylabel Energy cost contains an object of type line.$

Comfort Temperature violation: 0/1440 minutes, cost: 7.717194 dollars

Next, evaluate the agent's performance using the temperature data from April 15th, 2022. The agent did not use this temperature data during training either.

% Validate agent using the data from April 15
validationTemperature = temperatureApril15;
experience2 = sim(env,agent,simOptions);
localPlotResults( ...
    experience2, ...
    maxSteps, ...
    comfortMax, ...
    comfortMin, ...
    sampleTime,2)

Comfort Temperature violation: 0/1440 minutes, cost: 8.259467 dollars

Evaluate the agent's performance when the temperature is mild. Add eight degrees to the temperature from April 15th to create data for mild temperatures.

% Validate agent using the data from April 15 + 8 degrees
validationTemperature = temperatureApril15;
validationTemperature(:,2) = validationTemperature(:,2) + 8;
experience3 = sim(env,agent,simOptions);
localPlotResults(experience3, ...
    maxSteps, ...
    comfortMax, ...
    comfortMin, ...
    sampleTime, ...
    3)

Comfort Temperature violation: 0/1440 minutes, cost: 0.771909 dollars

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState)

Local Function

function localPlotResults( ...
    experience, maxSteps, comfortMax, comfortMin, sampleTime, figNum)
    % localPlotResults plots results of validation

    % Compute comfort temperature violation
    obsData = experience.Observation.obs1.Data(1,:,1:maxSteps);
    minutesViolateComfort = ...
          sum(obsData < comfortMin) ... 
        + sum(obsData > comfortMax);
    
    % Cost of energy
    totalCosts = ...
        experience.SimulationInfo(1).househeat_output{1}.Values;
    totalCosts.Time = totalCosts.Time/60;
    totalCosts.TimeInfo.Units="minutes";
    totalCosts.Name = "Total Energy Cost";
    finalCost = totalCosts.Data(end);

    % Cost of energy per step
    costPerStep = ...
        experience.SimulationInfo(1).househeat_output{2}.Values;
    costPerStep.Time = costPerStep.Time/60;
    costPerStep.TimeInfo.Units="minutes";    
    costPerStep.Name = "Energy Cost per Step";
    minutes = (0:maxSteps)*sampleTime/60;

    % Plot results   

    fig = figure(figNum);
    % Change the size of the figure
    fig.Position = fig.Position + [0, 0, 0, 200];

    % Temperatures
    layoutResult = tiledlayout(3,1);
    nexttile
    plot(minutes, ...
        reshape(experience.Observation.obs1.Data(1,:,:), ...
        [1,length(experience.Observation.obs1.Data)]),"k")
    hold on
    plot(minutes, ...
        reshape(experience.Observation.obs1.Data(2,:,:), ...
        [1,length(experience.Observation.obs1.Data)]),"g")
    yline(comfortMin,'b')
    yline(comfortMax,'r')
    lgd = legend("T_{room}", "T_{outside}","T_{comfortMin}", ...
        "T_{comfortMax}","location","northoutside");
    lgd.NumColumns = 4;
    title("Temperatures")
    ylabel("Temperature")
    xlabel("Time (minutes)")
    hold off

    % Total cost
    nexttile
    plot(totalCosts)    
    title("Total Cost")
    ylabel("Energy cost")

    % Cost per step
    nexttile
    plot(costPerStep)  
    title("Cost per step")
    ylabel("Energy cost")    
    fprintf("Comfort Temperature violation:" + ...
        " %d/1440 minutes, cost: %f dollars\n", ...
        minutesViolateComfort, finalCost);
end

Reference

[1] Du, Yan, Fangxing Li, Kuldeep Kurte, Jeffrey Munk, and Helia Zandi. “Demonstration of Intelligent HVAC Load Management With Deep Reinforcement Learning: Real-World Experience of Machine Learning in Demand Control.” IEEE Power and Energy Magazine 20, no. 3 (May 2022): 42–53. https://doi.org/10.1109/MPE.2022.3150825.

Train DQN Agent with LSTM Network to Control House Heating System

Fix Random Number Seed to Improve Reproducibility

House Heating Model

Define Simulink Environment

Create DQN Agent with LSTM Network

Train Agent

Simulate DQN Agent

Local Function

Reference

See Also

Functions

Objects

Related Examples

More About