Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2306.05726v2
- Date: Tue, 17 Oct 2023 16:25:25 GMT
- Title: Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning
- Authors: Xiaohan Hu, Yi Ma, Chenjun Xiao, Yan Zheng, Jianye Hao
- Abstract summary: In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
- Score: 57.10922880400715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the fundamental challenges for offline reinforcement learning (RL) is
ensuring robustness to data distribution. Whether the data originates from a
near-optimal policy or not, we anticipate that an algorithm should demonstrate
its ability to learn an effective control policy that seamlessly aligns with
the inherent distribution of offline data. Unfortunately, behavior
regularization, a simple yet effective offline RL algorithm, tends to struggle
in this regard. In this paper, we propose a new algorithm that substantially
enhances behavior-regularization based on conservative policy iteration. Our
key observation is that by iteratively refining the reference policy used for
behavior regularization, conservative policy update guarantees gradually
improvement, while also implicitly avoiding querying out-of-sample actions to
prevent catastrophic learning failures. We prove that in the tabular setting
this algorithm is capable of learning the optimal policy covered by the offline
dataset, commonly referred to as the in-sample optimal policy. We then explore
several implementation details of the algorithm when function approximations
are applied. The resulting algorithm is easy to implement, requiring only a few
lines of code modification to existing methods. Experimental results on the
D4RL benchmark indicate that our method outperforms previous state-of-the-art
baselines in most tasks, clearly demonstrate its superiority over behavior
regularization.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Offline Reinforcement Learning with Soft Behavior Regularization [0.8937096931077437]
In this work, we derive a new policy learning objective that can be used in the offline setting.
Unlike state-independent regularization used in prior approaches, this textitsoft regularization allows more freedom of policy deviation.
Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.
arXiv Detail & Related papers (2021-10-14T14:29:44Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.