Less is More: Clustered Cross-Covariance Control for Offline RL
- URL: http://arxiv.org/abs/2601.20765v2
- Date: Sat, 31 Jan 2026 08:04:07 GMT
- Title: Less is More: Clustered Cross-Covariance Control for Offline RL
- Authors: Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren,
- Abstract summary: A fundamental challenge in offline reinforcement learning is distributional shift.<n>We propose partitioned buffer sampling that restricts updates to localized replay partitions.<n>We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update.
- Score: 13.198112768636207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.
Related papers
- Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning [22.17044827069627]
We propose a plug-and-play training sample replacer to replace low-value actions with high-value actions generated by a stable actor.<n>Experiments show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
arXiv Detail & Related papers (2026-02-07T08:44:27Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Geometric-Disentangelment Unlearning [106.99160454669902]
gradient ascent on forget samples often harms retained knowledge.<n>We propose the Geometric-disment Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component.<n>Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects.
arXiv Detail & Related papers (2025-11-21T09:58:25Z) - Partial Action Replacement: Tackling Distribution Shift in Offline MARL [11.861550409939818]
offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution joint actions.<n>We develop Soft-Partial Conservative Q-Learning (SPaCQL) to mitigate OOD issue and dynamically weighting different PAR strategies.<n>Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty-informed weights.
arXiv Detail & Related papers (2025-11-10T20:56:58Z) - Learning from Sparse Offline Datasets via Conservative Density
Estimation [27.93418377019955]
We propose a novel training algorithm called Conservative Density Estimation (CDE)
CDE addresses the challenge by explicitly imposing constraints on the state-action occupancy stationary distribution.
Our method achieves state-of-the-art performance on the D4RL benchmark.
arXiv Detail & Related papers (2024-01-16T20:42:15Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - The Power and Limitation of Pretraining-Finetuning for Linear Regression
under Covariate Shift [127.21287240963859]
We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data.
For a large class of linear regression instances, transfer learning with $O(N2)$ source data is as effective as supervised learning with $N$ target data.
arXiv Detail & Related papers (2022-08-03T05:59:49Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.