optimization problem using Q learning
4 views (last 30 days)
Show older comments
I have problem with optimization using Q-learning.
The concept of my optimization is charging ESS during the cheapest time and discharge during the most expensive time to reduce total cost of electrocity.
Reward for each step is how ESS reduced the cost of electricity. There are some constraints that SOC of ESS can not be lower.
Everything is good but action is how much ESS discharge or charge and state SOC reflect that action.
If ESS discharge, SOC get lower, ESS charge, SOC get higher. But this state doesn't reflect action properly, can anyone help this problem?
clc; clear;
%% ===== 환경 파라미터 =====
T = 48;
eff_cha = 0.95;
eff_dch = 0.95;
SOC_min = 0.1;
SOC_max = 1;
SOC0 = 0.5;
P_ess_max = 1500; % ESS 최대 충/방전 전력 (kW)
ESS_cap = 3000; % ESS 용량 (kWh)
actions = linspace(-0.5,0.5,41); % 비율 [-0.5~0.5]
numActions = length(actions);
%% ===== 학습 파라미터 =====
alpha = 0.1;
gamma = 0.99; % 여기선 MC 방식이라 직접 사용 안 함
epsilon = 0.5;
epsilon_min = 0.05;
epsilon_decay = 0.995;
numEpisodes = 60000;
%% ===== 상태 공간 (이산화) =====
numSOCs = 101;
numPrices=3
Q = zeros(numSOCs, numPrices, T, numActions);
%% ===== 가격/부하 데이터 =====
price_real = 140.5*ones(1,24);
price_real(1:7) = 87.3;
price_real(22:24) = 87.3;
price_real(8:10) = 109.8;
price_real(12) = 109.8;
price_real(18:21) = 109.8;
price_real = [price_real, price_real]; % 48시간
load_real = table2array(readtable('48_consumption_6.1.xlsx'));
pv_real = table2array(readtable("PV_gen.xlsx"));
load_real = load_real - pv_real; % 순부하(kW)
%% ===== SOC 임계값 =====
p_crt_val = 0.03;
for p =1:24
p_crt(p) = p_crt_val*(24-p);
end
p_crt = [p_crt, p_crt] + 0.03*randn(1,48);
%% ===== 상태 이산화용 정규화 =====
price_norm = price_real / max(price_real);
discretizeState = @(x) min(max(floor(x * numPrices) + 1, 1), numPrices);
%% ===== Monte Carlo 학습 루프 =====
saving_history = NaN(1,numEpisodes); % 완주한 에피소드 절감액만 기록
completion_rate = zeros(1,numEpisodes); % 완주율 기록
for ep = 1:numEpisodes
SOC = SOC0; % 초기 SOC 비율
episode_memory = []; % [SOC_idx, price_idx, time, action_idx]
grid_before_ep = zeros(1,T);
grid_after_ep = zeros(1,T);
done_flag = true; % 완주 여부
for t = 1:T
% 상태
s_idx = [discretizeState(SOC), discretizeState(price_norm(t)), t];
% ε-greedy 액션
if rand < epsilon
a_idx = randi(numActions);
else
[~, a_idx] = max(Q(s_idx(1), s_idx(2), s_idx(3), :));
end
a_kW = actions(a_idx) * P_ess_max;
% SOC 업데이트
if a_kW >= 0
SOC_next = SOC + (a_kW / ESS_cap) * eff_cha;
else
SOC_next = SOC + (a_kW / ESS_cap) / eff_dch;
end
% 하드 제약 위반 시 중단
if SOC_next > SOC_max
SOC_next = SOC_max; a_kW = 0;
elseif SOC_next < SOC_min
SOC_next = SOC_min; a_kW = 0;
elseif SOC_next < p_crt(t)
SOC_next = p_crt(t); a_kW = 0;
end
% 전력 기록
grid_before_ep(t) = load_real(t);
grid_after_ep(t) = load_real(t) + a_kW;
% 상태/행동 기록
episode_memory(end+1,:) = [s_idx, a_idx];
SOC = SOC_next;
end
% 완주한 경우만 Q 업데이트 & 기록
if done_flag && length(episode_memory) == T
cost_before_ep = sum(grid_before_ep .* price_real);
cost_after_ep = sum(grid_after_ep .* price_real);
saving_ep = cost_before_ep - cost_after_ep;
saving_history(ep) = saving_ep;
for step = 1:size(episode_memory,1)
s_idx = episode_memory(step,1:3);
a_idx = episode_memory(step,4);
Q(s_idx(1), s_idx(2), s_idx(3), a_idx) = ...
Q(s_idx(1), s_idx(2), s_idx(3), a_idx) + ...
alpha * (saving_ep - Q(s_idx(1), s_idx(2), s_idx(3), a_idx));
end
end
% ε 감소
if epsilon > epsilon_min
epsilon = epsilon * epsilon_decay;
end
% 완주율 기록
completion_rate(ep) = sum(~isnan(saving_history)) / ep;
if mod(ep,10000) == 0
fprintf("Episode %d: 완주=%d, 절감액=%.2f원, ε=%.3f\n", ...
ep, done_flag, saving_history(ep), epsilon);
end
end
%% ===== 학습 성과 시각화 =====
%% ===== 학습된 정책 시뮬레이션 =====
SOC = SOC0;
SOC_traj = zeros(1,T);
act_traj = zeros(1,T);
grid_power_before = zeros(1,T);
grid_power_after = zeros(1,T);
for t = 1:T
grid_power_before(t) = load_real(t);
s_idx = [discretizeState(SOC), discretizeState(price_norm(t)), t];
[~, a_idx] = max(Q(s_idx(1), s_idx(2), s_idx(3), :));
a_kW = actions(a_idx) * P_ess_max;
if a_kW >= 0
SOC_next = SOC + (a_kW / ESS_cap) * eff_cha;
else
SOC_next = SOC + (a_kW / ESS_cap) / eff_dch;
end
if SOC_next > SOC_max
SOC_next = SOC_max; a_kW = 0;
elseif SOC_next < SOC_min
SOC_next = SOC_min; a_kW = 0;
elseif SOC_next < p_crt(t)
SOC_next = p_crt(t); a_kW = 0;
end
grid_power_after(t) = load_real(t) + a_kW;
SOC_traj(t) = SOC_next;
act_traj(t) = a_kW;
SOC = SOC_next;
end
%% ===== 최종 비용 계산 =====
cost_before = sum(grid_power_before .* price_real);
cost_after = sum(grid_power_after .* price_real);
saving = cost_before - cost_after;
fprintf('최종 ESS 미사용 전 전기비용: %.3f 원\n', cost_before);
fprintf('최종 ESS 사용 후 전기비용: %.3f 원\n', cost_after);
fprintf('최종 절감 금액: %.3f 원 (절감률 %.2f%%)\n', saving, saving/cost_before*100);
%% ===== 시뮬레이션 결과 시각화 =====
figure;
plot(saving_history); title('Learning Curve'); xlabel('Episode'); ylabel('Total Reward'); yticks(-4e5:1e5:9e5); grid on;
figure;
plot(100*SOC_traj,'LineWidth',1); hold on; plot(100*p_crt, 'r','LineWidth',1); title('SOC Trajectory'); ylabel('SOC(%)');ylim([-5 105]);legend('SOC','Critical Load'); grid on;
figure;
stairs(act_traj, '-x'); title('Action Trajectory (kW)'); grid on;
figure;
stairs(price_real); title('Price'); xlabel('Time'); ylabel('Price');
1 Comment
Goutam
on 3 Sep 2025
Hi 찬목 박,
I think your Q-learning agent might be having trouble because it can't clearly link its actions to the outcomes they produce. With only three price levels, it might be too hard for the agent to tell the difference between low, medium, and high electricity costs especially when those differences can be pretty significant.
Also, since you're using Monte Carlo learning, the agent only gets feedback after the full 48-hour cycle is over. So it’s like getting a final score “you saved 600k won” without knowing which specific decisions (like charging at hour 5 or discharging at hour 20) actually made that happen. That kind of delayed and vague feedback makes it tough for the agent to learn which actions were truly valuable.
One idea that might help: try increasing 'numPrices' from 3 to 10. It’s a small tweak, but it could give the agent a much clearer picture of the price landscape and help it make more informed decisions without needing to redesign the whole system.
Answers (0)
See Also
Categories
Find more on FPGA, ASIC, and SoC Development in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


