Policy Constraint by Only Support Constraint for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2503.05207v1
- Date: Fri, 07 Mar 2025 07:55:51 GMT
- Title: Policy Constraint by Only Support Constraint for Offline Reinforcement Learning
- Authors: Yunkai Gao, Jiaming Guo, Fan Wu, Rui Zhang,
- Abstract summary: We present Only Support Constraint (OSC) which is derived from maximizing the total probability of learned policy in the support of behavior policy.<n>OSC significantly enhances performance, alleviating the challenges associated with distributional shifts and mitigating conservatism of policy constraints.
- Score: 11.006709826558465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) aims to optimize a policy by using pre-collected datasets, to maximize cumulative rewards. However, offline reinforcement learning suffers challenges due to the distributional shift between the learned and behavior policies, leading to errors when computing Q-values for out-of-distribution (OOD) actions. To mitigate this issue, policy constraint methods aim to constrain the learned policy's distribution with the distribution of the behavior policy or confine action selection within the support of the behavior policy. However, current policy constraint methods tend to exhibit excessive conservatism, hindering the policy from further surpassing the behavior policy's performance. In this work, we present Only Support Constraint (OSC) which is derived from maximizing the total probability of learned policy in the support of behavior policy, to address the conservatism of policy constraint. OSC presents a regularization term that only restricts policies to the support without imposing extra constraints on actions within the support. Additionally, to fully harness the performance of the new policy constraints, OSC utilizes a diffusion model to effectively characterize the support of behavior policies. Experimental evaluations across a variety of offline RL benchmarks demonstrate that OSC significantly enhances performance, alleviating the challenges associated with distributional shifts and mitigating conservatism of policy constraints. Code is available at https://github.com/MoreanP/OSC.
Related papers
- SPoRt -- Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL [54.022106606140774]
We present theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup.
We also present SPoRt, which enables the user to trade off safety guarantees in exchange for task-specific performance.
arXiv Detail & Related papers (2025-04-08T19:09:07Z) - Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning [37.660801621012745]
offline safe reinforcement learning (OSRL) involves learning a decision-making policy to maximize rewards from a fixed batch of training data.
We introduce constraint-adaptive policy switching (CAPS), a wrapper framework around existing offline RL algorithms.
CAPS consistently outperforms existing methods, establishing a strong wrapper-based baseline for OSRL.
arXiv Detail & Related papers (2024-12-25T16:42:27Z) - SelfBC: Self Behavior Cloning for Offline Reinforcement Learning [14.573290839055316]
We propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies.
Our approach results in a nearly monotonically improved reference policy.
arXiv Detail & Related papers (2024-08-04T23:23:48Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Policy Regularization with Dataset Constraint for Offline Reinforcement
Learning [27.868687398300658]
We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL)
In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with dataset Constraint (PRDC)
PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state.
arXiv Detail & Related papers (2023-06-11T03:02:10Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization.
Regularization methods reduce the divergence between the learned policy and the behavior policy.
This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.