Offline Reinforcement Learning with Soft Behavior Regularization
- URL: http://arxiv.org/abs/2110.07395v1
- Date: Thu, 14 Oct 2021 14:29:44 GMT
- Title: Offline Reinforcement Learning with Soft Behavior Regularization
- Authors: Haoran Xu, Xianyuan Zhan, Jianxiong Li, Honglei Yin
- Abstract summary: In this work, we derive a new policy learning objective that can be used in the offline setting.
Unlike state-independent regularization used in prior approaches, this textitsoft regularization allows more freedom of policy deviation.
Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.
- Score: 0.8937096931077437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most prior approaches to offline reinforcement learning (RL) utilize
\textit{behavior regularization}, typically augmenting existing off-policy
actor critic algorithms with a penalty measuring divergence between the policy
and the offline data. However, these approaches lack guaranteed performance
improvement over the behavior policy. In this work, we start from the
performance difference between the learned policy and the behavior policy, we
derive a new policy learning objective that can be used in the offline setting,
which corresponds to the advantage function value of the behavior policy,
multiplying by a state-marginal density ratio. We propose a practical way to
compute the density ratio and demonstrate its equivalence to a state-dependent
behavior regularization. Unlike state-independent regularization used in prior
approaches, this \textit{soft} regularization allows more freedom of policy
deviation at high confidence states, leading to better performance and
stability. We thus term our resulting algorithm Soft Behavior-regularized Actor
Critic (SBAC). Our experimental results show that SBAC matches or outperforms
the state-of-the-art on a set of continuous control locomotion and manipulation
tasks.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Offline Reinforcement Learning with Fisher Divergence Critic
Regularization [41.085156836450466]
We propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy.
Behavior regularization then corresponds to an appropriate regularizer on the offset term.
Our algorithm Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.
arXiv Detail & Related papers (2021-03-14T22:11:40Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.