Related papers: Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning

Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning

URL: http://arxiv.org/abs/2205.10047v1
Date: Fri, 20 May 2022 09:38:04 GMT
Title: Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning
Authors: Xing Chen, Dongcui Diao, Hechang Chen, Hengshuai Yao, Jielong Yang, Haiyin Piao, Zhixiao Sun, Bei Jiang, Yi Chang
Abstract summary: We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O) P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. Results show that our P3O maximizes the CPI objective better than PPO during the training process.
Score: 14.991913317341417
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.

Related papers

Relative Entropy Pathwise Policy Optimization [56.86405621176669]
We show how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data.<n>We propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z)
Residual Policy Gradient: A Reward View of KL-regularized Objective [48.39829592175419]
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. Policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. We introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings.
arXiv Detail & Related papers (2025-03-14T02:30:13Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation. PAAC accounts for both $Q$ value and TD error in its actor update. Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z)
Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z)
Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.