Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning
- URL: http://arxiv.org/abs/2205.10047v1
- Date: Fri, 20 May 2022 09:38:04 GMT
- Title: Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning
- Authors: Xing Chen, Dongcui Diao, Hechang Chen, Hengshuai Yao, Jielong Yang,
Haiyin Piao, Zhixiao Sun, Bei Jiang, Yi Chang
- Abstract summary: We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
- Score: 14.991913317341417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major difficulties of reinforcement learning is learning from {\em
off-policy} samples, which are collected by a different policy (behavior
policy) from what the algorithm evaluates (the target policy). Off-policy
learning needs to correct the distribution of the samples from the behavior
policy towards that of the target policy. Unfortunately, important sampling has
an inherent high variance issue which leads to poor gradient estimation in
policy gradient methods. We focus on an off-policy Actor-Critic architecture,
and propose a novel method, called Preconditioned Proximal Policy Optimization
(P3O), which can control the high variance of importance sampling by applying a
preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This
preconditioning uses the sigmoid function in a special way that when there is
no policy change, the gradient is maximal and hence policy gradient will drive
a big parameter update for an efficient exploration of the parameter space}.
This is a novel exploration method that has not been studied before given that
existing exploration methods are based on the novelty of states and actions. We
compare with several best-performing algorithms on both discrete and continuous
tasks and the results confirmed that {\em P3O is more off-policy than PPO}
according to the "off-policyness" measured by the DEON metric, and P3O explores
in a larger policy space than PPO. Results also show that our P3O maximizes the
CPI objective better than PPO during the training process.
Related papers
- Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems.
Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process.
This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - Gradient Informed Proximal Policy Optimization [35.22712034665224]
We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm.
By adaptively modifying the alpha value, we can effectively manage the influence of analytical policy gradients during learning.
Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments.
arXiv Detail & Related papers (2023-12-14T07:50:21Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.