Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching
- URL: http://arxiv.org/abs/2404.17939v2
- Date: Tue, 30 Apr 2024 14:32:51 GMT
- Title: Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching
- Authors: Robert Denkert, Huyên Pham, Xavier Warin,
- Abstract summary: We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning.
This is based on the connection between control problems and randomised problems, enabling applications across various classes of Markovian continuous time control problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning. This is based on the connection between stochastic control problems and randomised problems, enabling applications across various classes of Markovian continuous time control problems, beyond diffusion models, including e.g. regular, impulse and optimal stopping/switching problems. By utilizing change of measure in the control randomisation technique, we derive a new policy gradient representation for these randomised problems, featuring parametrised intensity policies. We further develop actor-critic algorithms specifically designed to address general Markovian stochastic control issues. Our framework is demonstrated through its application to optimal switching problems, with two numerical case studies in the energy sector focusing on real options.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Control in Stochastic Environment with Delays: A Model-based
Reinforcement Learning Approach [3.130722489512822]
We introduce a new reinforcement learning method for control problems in environments with delayed feedback.
Specifically, our method employs planning, versus previous methods that used deterministic planning.
We show that this formulation can recover the optimal policy for problems with deterministic transitions.
arXiv Detail & Related papers (2024-02-01T03:53:56Z) - Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods [0.40964539027092917]
Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems.
In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming.
This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time.
arXiv Detail & Related papers (2023-10-04T09:21:01Z) - Towards a Theoretical Foundation of Policy Optimization for Learning
Control Policies [26.04704565406123]
Gradient-based methods have been widely used for system design and optimization in diverse application domains.
Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning.
This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis.
arXiv Detail & Related papers (2022-10-10T16:13:34Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Non-stationary Online Learning with Memory and Non-stochastic Control [71.14503310914799]
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions.
In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments.
We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length.
arXiv Detail & Related papers (2021-02-07T09:45:15Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.