Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation
- URL: http://arxiv.org/abs/2508.06806v1
- Date: Sat, 09 Aug 2025 03:32:23 GMT
- Title: Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation
- Authors: Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li,
- Abstract summary: Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions.<n>Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation.<n>We propose a new data augmentation approach, Diffusion-Free Generation (CFDG)
- Score: 22.13678670717358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.
Related papers
- A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs [55.931369468485464]
We tackle offline data selection and online self-refining generations through an optimization perspective.<n>For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework.
arXiv Detail & Related papers (2025-11-26T04:48:33Z) - From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification [3.2883573376133555]
StratDiff is a diffusion model to learn prior knowledge from the offline dataset.<n>It refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning.<n>Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies.
arXiv Detail & Related papers (2025-11-05T19:48:46Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model [57.20064815347607]
offline reinforcement learning (RL) recently gains growing interests from RL researchers.<n>The performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL.<n>In this paper, we first build a bridge over the batch data and the performance of offline RL algorithms theoretically.<n>We show that in task-agnostic settings, a series of policies trained by unsupervised RL can minimize the worst-case regret in the performance gap.
arXiv Detail & Related papers (2025-06-24T14:08:36Z) - Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data [32.7248232143849]
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL)<n>This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments.
arXiv Detail & Related papers (2025-02-26T20:34:29Z) - Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
We introduce A3RL, which incorporates a novel confidence aware Active Advantage Aligned sampling strategy.<n>We demonstrate that our method outperforms competing online RL techniques that leverage offline data.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - Goal-Conditioned Data Augmentation for Offline Reinforcement Learning [3.5775697416994485]
We introduce Goal-cOnditioned Data Augmentation (GODA), a goal-conditioned diffusion-based method for augmenting samples with higher quality.<n>GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals.<n>We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA's effectiveness.
arXiv Detail & Related papers (2024-12-29T16:42:30Z) - Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning [13.802860320234469]
We introduce textbfEnergy-guided textbfDIffusion textbfSampling (EDIS)
EDIS uses a diffusion model to extract prior knowledge from the offline dataset and employs energy functions to distill this knowledge for enhanced data generation in the online phase.
We observe a notable 20% average improvement in empirical performance on MuJoCo, AntMaze, and Adroit environments.
arXiv Detail & Related papers (2024-07-17T09:56:51Z) - ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories [27.5648276335047]
Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL)
We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory diffuser (ATraDiff)
ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.