Optimal scheduling of entropy regulariser for continuous-time
linear-quadratic reinforcement learning
- URL: http://arxiv.org/abs/2208.04466v3
- Date: Fri, 15 Sep 2023 00:54:48 GMT
- Title: Optimal scheduling of entropy regulariser for continuous-time
linear-quadratic reinforcement learning
- Authors: Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang
- Abstract summary: Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy.
This exploration-exploitation trade-off is determined by the strength of entropy regularisation.
We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
- Score: 9.779769486156631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work uses the entropy-regularised relaxed stochastic control perspective
as a principled framework for designing reinforcement learning (RL) algorithms.
Herein agent interacts with the environment by generating noisy controls
distributed according to the optimal relaxed policy. The noisy policies on the
one hand, explore the space and hence facilitate learning but, on the other
hand, introduce bias by assigning a positive probability to non-optimal
actions. This exploration-exploitation trade-off is determined by the strength
of entropy regularisation. We study algorithms resulting from two entropy
regularisation formulations: the exploratory control approach, where entropy is
added to the cost objective, and the proximal policy update approach, where
entropy penalises policy divergence between consecutive episodes. We focus on
the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a
linear dynamics with unknown drift coefficients is controlled subject to
quadratic costs. In this setting, both algorithms yield a Gaussian relaxed
policy. We quantify the precise difference between the value functions of a
Gaussian policy and its noisy evaluation and show that the execution noise must
be independent across time. By tuning the frequency of sampling from relaxed
policies and the parameter governing the strength of entropy regularisation, we
prove that the regret, for both learning algorithms, is of the order
$\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes,
matching the best known result from the literature.
Related papers
- Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Exploratory Optimal Stopping: A Singular Control Formulation [2.7309692684728613]
We explore continuous-time and state-space optimal stopping problems from a reinforcement learning perspective.
We introduce a regularized version of the problem by penalizing it with the cumulative residual entropy of the randomized stopping time.
For the specific case of a real option problem, we derive a semiexplicit solution to the regularized problem.
arXiv Detail & Related papers (2024-08-18T02:31:55Z) - Entropy annealing for policy mirror descent in continuous time and space [2.8255028200738455]
We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function.
We prove that with a fixed entropy level, the dynamics converges exponentially to the optimal solution of the regularized problem.
arXiv Detail & Related papers (2024-05-30T17:02:18Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - $K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic
Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data.
Our analysis allows for the sampling of entire episodes, as is common practice in most applications.
Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization [20.651913793555163]
We revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization.
We establish a global optimality convergence result and a sample complexity of $widetildemathcalO(frac1epsilon2)$ for the proposed algorithm.
arXiv Detail & Related papers (2021-10-19T17:21:09Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.