Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction
- URL: http://arxiv.org/abs/2110.12081v1
- Date: Fri, 22 Oct 2021 22:07:51 GMT
- Title: Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction
- Authors: Jiachen Li, Shuo Cheng, Zhenyu Liao, Huayan Wang, William Yang Wang,
Qinxun Bai
- Abstract summary: We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
- Score: 73.77593805292194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving sample efficiency of reinforcement learning algorithms requires
effective exploration. Following the principle of $\textit{optimism in the face
of uncertainty}$, we train a separate exploration policy to maximize an
approximate upper confidence bound of the critics in an off-policy actor-critic
framework. However, this introduces extra differences between the replay buffer
and the target policy in terms of their stationary state-action distributions.
To mitigate the off-policy-ness, we adapt the recently introduced DICE
framework to learn a distribution correction ratio for off-policy actor-critic
training. In particular, we correct the training distribution for both policies
and critics. Empirically, we evaluate our proposed method in several
challenging continuous control tasks and show superior performance compared to
state-of-the-art methods. We also conduct extensive ablation studies to
demonstrate the effectiveness and the rationality of the proposed method.
Related papers
- Preference-Guided Reinforcement Learning for Efficient Exploration [7.83845308102632]
We introduce LOPE: Learning Online with trajectory Preference guidancE, an end-to-end preference-guided RL framework.
Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance.
LOPE outperforms several state-of-the-art methods regarding convergence rate and overall performance.
arXiv Detail & Related papers (2024-07-09T02:11:12Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Learning Pessimism for Robust and Efficient Off-Policy Reinforcement
Learning [0.0]
Off-policy deep reinforcement learning algorithms compensate for overestimation bias during temporal-difference learning.
In this work, we propose a novel learnable penalty to enact such pessimism.
We also propose to learn the penalty alongside the critic with dual TD-learning, a strategy to estimate and minimize the bias magnitude in the target returns.
arXiv Detail & Related papers (2021-10-07T12:13:19Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.