MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage
- URL: http://arxiv.org/abs/2104.02411v1
- Date: Tue, 6 Apr 2021 10:37:14 GMT
- Title: MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage
- Authors: Arash Bahari Kordabad, Wenqi Cai, Sebastien Gros
- Abstract summary: We focus on policy approximations based on Model Predictive Control (MPC)
We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure.
We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we are interested in optimal control problems with purely
economic costs, which often yield optimal policies having a (nearly) bang-bang
structure. We focus on policy approximations based on Model Predictive Control
(MPC) and the use of the deterministic policy gradient method to optimize the
MPC closed-loop performance in the presence of unmodelled stochasticity or
model error. When the policy has a (nearly) bang-bang structure, we observe
that the policy gradient method can struggle to produce meaningful steps in the
policy parameters. To tackle this issue, we propose a homotopy strategy based
on the interior-point method, providing a relaxation of the policy during the
learning. We investigate a specific well-known battery storage problem, and
show that the proposed method delivers a homogeneous and faster learning than a
classical policy gradient approach.
Related papers
- Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline [47.16115174891401]
We propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue.
We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.
arXiv Detail & Related papers (2024-05-04T05:21:28Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Generalised Policy Improvement with Geometric Policy Composition [18.80807234471197]
We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL.
We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs.
We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors.
arXiv Detail & Related papers (2022-06-17T12:52:13Z) - Variance Reduction based Partial Trajectory Reuse to Accelerate Policy
Gradient Optimization [3.621753051212441]
We extend the idea of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for Markov Decision Processes (MDP)
In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies.
arXiv Detail & Related papers (2022-05-06T01:42:28Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - A Study of Policy Gradient on a Class of Exactly Solvable Models [35.90565839381652]
We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain.
Our approach relies heavily on random walk theory, specifically on affine Weyl groups.
We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
arXiv Detail & Related papers (2020-11-03T17:27:53Z) - PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient
Learning [35.044047991893365]
This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which balances the exploration vs. exploitation tradeoff using an ensemble of policies (the policy cover)
We show that PC-PG has strong guarantees under model misspecification that go beyond the standard worst case $ell_infty$ assumptions.
We also complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
arXiv Detail & Related papers (2020-07-16T16:57:41Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.