Related papers: Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

URL: http://arxiv.org/abs/2505.19923v1
Date: Mon, 26 May 2025 12:45:54 GMT
Title: Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL
Authors: Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang,
Abstract summary: We propose a selective state-adaptive regularization method for offline reinforcement learning.<n>We show that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings.
Score: 36.65926744075032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.

Related papers

EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.<n>We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z)
Iteratively Refined Behavior Regularization for Offline Reinforcement Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)
Model-based trajectory stitching for improved behavioural cloning and its applications [7.462336024223669]
Trajectory Stitching (TS) generates new trajectories by stitching' pairs of states that were disconnected in the original data. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.
arXiv Detail & Related papers (2022-12-08T14:18:04Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets. We propose a novel approach, which we refer to as adaptive behavior regularization (ABR) ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z)
Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets. A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy. We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z)
Supported Policy Optimization for Offline Reinforcement Learning [74.1011309005488]
Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization. Regularization methods reduce the divergence between the learned policy and the behavior policy. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint.
arXiv Detail & Related papers (2022-02-13T07:38:36Z)
Offline Reinforcement Learning with Soft Behavior Regularization [0.8937096931077437]
In this work, we derive a new policy learning objective that can be used in the offline setting. Unlike state-independent regularization used in prior approaches, this textitsoft regularization allows more freedom of policy deviation. Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.
arXiv Detail & Related papers (2021-10-14T14:29:44Z)
BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets. Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions. We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.