Policy Learning and Evaluation with Randomized Quasi-Monte Carlo
- URL: http://arxiv.org/abs/2202.07808v1
- Date: Wed, 16 Feb 2022 00:42:12 GMT
- Title: Policy Learning and Evaluation with Randomized Quasi-Monte Carlo
- Authors: Sebastien M. R. Arnold, Pierre L'Ecuyer, Liyu Chen, Yi-fan Chen, Fei
Sha
- Abstract summary: We propose to replace Monte Carlo samples with low-discrepancy point sets.
We combine policy gradient methods with Randomized Quasi-Monte Carlo, yielding variance-reduced formulations of policy gradient and actor-critic algorithms.
Our empirical analyses validate the intuition that replacing Monte Carlo with Quasi-Monte Carlo yields significantly more accurate gradient estimates.
- Score: 28.835015520341766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning constantly deals with hard integrals, for example when
computing expectations in policy evaluation and policy iteration. These
integrals are rarely analytically solvable and typically esimated with the
Monte Carlo method, which induces high variance in policy values and gradients.
In this work, we propose to replace Monte Carlo samples with low-discrepancy
point sets. We combine policy gradient methods with Randomized Quasi-Monte
Carlo, yielding variance-reduced formulations of policy gradient and
actor-critic algorithms. These formulations are effective for policy evaluation
and policy improvement, as they outperform state-of-the-art algorithms on
standardized continuous control benchmarks. Our empirical analyses validate the
intuition that replacing Monte Carlo with Quasi-Monte Carlo yields
significantly more accurate gradient estimates.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Actor-Critic learning for mean-field control in continuous time [0.0]
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting.
By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function.
In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
arXiv Detail & Related papers (2023-03-13T10:49:25Z) - On The Convergence Of Policy Iteration-Based Reinforcement Learning With
Monte Carlo Policy Evaluation [11.345796608258434]
We show that a first-visit version of such a policy iteration scheme converges to the optimal policy provided that the policy improvement step uses lookahead.
We also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.
arXiv Detail & Related papers (2023-01-23T20:32:41Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Fast Offline Policy Optimization for Large Scale Recommendation [74.78213147859236]
We derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size.
Our contribution is based upon combining three novel ideas.
Our estimator is an order of magnitude faster than naive approaches yet produces equally good policies.
arXiv Detail & Related papers (2022-08-08T11:54:11Z) - On the Convergence of Reinforcement Learning with Monte Carlo Exploring
Starts [5.137144629366217]
A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method.
We investigate the convergence of this algorithm for the case with undiscounted costs, also known as the shortest path problem.
As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in approximation.
arXiv Detail & Related papers (2020-07-21T16:19:09Z) - Variational Policy Gradient Method for Reinforcement Learning with
General Utilities [38.54243339632217]
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction.
In this paper, we consider policy in Decision Problems, where the objective converges a general concave utility function.
We derive a new Variational Policy Gradient Theorem for RL with general utilities.
arXiv Detail & Related papers (2020-07-04T17:51:53Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.