When Do Off-Policy and On-Policy Policy Gradient Methods Align?
- URL: http://arxiv.org/abs/2402.12034v1
- Date: Mon, 19 Feb 2024 10:42:34 GMT
- Title: When Do Off-Policy and On-Policy Policy Gradient Methods Align?
- Authors: Davide Mambelli, Stephan Bongers, Onno Zoeter, Matthijs T.J. Spaan,
Frans A. Oliehoek
- Abstract summary: Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces.
A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling.
This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap.
- Score: 15.7221450531432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy gradient methods are widely adopted reinforcement learning algorithms
for tasks with continuous action spaces. These methods succeeded in many
application domains, however, because of their notorious sample inefficiency
their use remains limited to problems where fast and accurate simulations are
available. A common way to improve sample efficiency is to modify their
objective function to be computable from off-policy samples without importance
sampling. A well-established off-policy objective is the excursion objective.
This work studies the difference between the excursion objective and the
traditional on-policy objective, which we refer to as the on-off gap. We
provide the first theoretical analysis showing conditions to reduce the on-off
gap while establishing empirical evidence of shortfalls arising when these
conditions are not met.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Behind the Myth of Exploration in Policy Gradients [1.9171404264679484]
Policy-gradient algorithms are effective reinforcement learning methods for solving control problems with continuous state and action spaces.
To compute near-optimal policies, it is essential in practice to include exploration terms in the learning objective.
arXiv Detail & Related papers (2024-01-31T20:37:09Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.