Related papers: Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

URL: http://arxiv.org/abs/2512.10510v1
Date: Thu, 11 Dec 2025 10:30:04 GMT
Title: Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning
Authors: Chihyeon Song, Jaewoo Lee, Jinkyoo Park,
Abstract summary: We introduce the Adaptive Replay Buffer (ARB), a novel approach that prioritizes data sampling based on a lightweight metric we call 'on-policyness'<n>ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing Offline-to-Online Reinforcement Learning algorithms.<n>Our experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms.
Score: 29.513882808306406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.

Related papers

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning [45.19254609437857]
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability.<n>Data attribution provides a principled way to trace model behavior back to training samples.<n>We propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates.
arXiv Detail & Related papers (2025-05-25T19:25:57Z)
Filtering Learning Histories Enhances In-Context Reinforcement Learning [12.697029805927398]
Transformer models (TMs) have exhibited remarkable in-context reinforcement learning capabilities.<n>We propose a simple yet effective approach, learning history filtering (LHF) to enhance ICRL.<n>LHF is the first approach to avoid source suboptimality by dataset preprocessing.
arXiv Detail & Related papers (2025-05-21T06:00:41Z)
Provably Efficient Online RLHF with One-Pass Reward Modeling [70.82499103200402]
Reinforcement Learning from Human Feedback has shown remarkable success in aligning Large Language Models with human preferences.<n>Online RLHF has emerged as a promising direction, enabling iterative data collection and refinement.<n>We propose a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset. We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL. The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z)
Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets. We propose a novel approach, which we refer to as adaptive behavior regularization (ABR) ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
A simple but strong baseline for online continual learning: Repeated Augmented Rehearsal [13.075018350152074]
Online continual learning (OCL) aims to train neural networks incrementally from a non-stationary data stream with a single pass through data. Rehearsal-based methods attempt to approximate the observed input distributions over time with a small memory and revisit them later to avoid forgetting. We provide theoretical insights on the inherent memory overfitting risk from the viewpoint of biased and dynamic empirical risk minimization.
arXiv Detail & Related papers (2022-09-28T08:43:35Z)
Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR) We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.