Policy Regularization with Dataset Constraint for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2306.06569v2
- Date: Tue, 15 Aug 2023 16:14:42 GMT
- Title: Policy Regularization with Dataset Constraint for Offline Reinforcement
Learning
- Authors: Yuhang Ran, Yi-Chen Li, Fuxiang Zhang, Zongzhang Zhang, Yang Yu
- Abstract summary: We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL)
In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with dataset Constraint (PRDC)
PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state.
- Score: 27.868687398300658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of learning the best possible policy from a fixed
dataset, known as offline Reinforcement Learning (RL). A common taxonomy of
existing offline RL works is policy regularization, which typically constrains
the learned policy by distribution or support of the behavior policy. However,
distribution and support constraints are overly conservative since they both
force the policy to choose similar actions as the behavior policy when
considering particular states. It will limit the learned policy's performance,
especially when the behavior policy is sub-optimal. In this paper, we find that
regularizing the policy towards the nearest state-action pair can be more
effective and thus propose Policy Regularization with Dataset Constraint
(PRDC). When updating the policy in a given state, PRDC searches the entire
dataset for the nearest state-action sample and then restricts the policy with
the action of this sample. Unlike previous works, PRDC can guide the policy
with proper behaviors from the dataset, allowing it to choose actions that do
not appear in the dataset along with the given state. It is a softer constraint
but still keeps enough conservatism from out-of-distribution actions. Empirical
evidence and theoretical analysis show that PRDC can alleviate offline RL's
fundamentally challenging value overestimation issue with a bounded performance
gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves
state-of-the-art performance compared with existing methods. Code is available
at https://github.com/LAMDA-RL/PRDC
Related papers
- Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning [12.112619241073158]
In offline reinforcement learning, the challenge of out-of-distribution is pronounced.
Existing methods often constrain the learned policy through policy regularization.
We propose Adaptive Advantage-guided Policy Regularization (A2PR)
arXiv Detail & Related papers (2024-05-30T10:20:55Z) - A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective [29.977702744504466]
We introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning.
A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies.
Experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts.
arXiv Detail & Related papers (2024-03-12T02:43:41Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.