Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only
- URL: http://arxiv.org/abs/2505.16856v1
- Date: Thu, 22 May 2025 16:14:08 GMT
- Title: Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only
- Authors: Wei Xiao, Jiacheng Liu, Zifeng Zhuang, Runze Suo, Shangke Lyu, Donglin Wang,
- Abstract summary: Existing online reinforcement learning (RL) fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance.<n>We propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy.<n>We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism.
- Score: 22.94253602450729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
Related papers
- Online Pre-Training for Offline-to-Online Reinforcement Learning [21.146400629843015]
We propose Online Pre-Training for Offline-to-Online RL (OPT) to address the issue of inaccurate value estimation in offline pre-trained agents.<n>OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function tailored specifically for effective online fine-tuning.
arXiv Detail & Related papers (2025-07-11T08:00:12Z) - Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
We introduce A3RL, which incorporates a novel confidence aware Active Advantage Aligned sampling strategy.<n>We demonstrate that our method outperforms competing online RL techniques that leverage offline data.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data [64.74333980417235]
We show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL.<n>We show that Warm-start RL (WSRL) is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms.
arXiv Detail & Related papers (2024-12-10T18:57:12Z) - ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles [52.34951901588738]
We propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL.
By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance.
Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods.
arXiv Detail & Related papers (2023-06-12T05:10:10Z) - Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs
and Practical Solutions [30.050083797177706]
offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment.
Online finetuning of such offline models can further improve performance.
We show that it is possible to use standard online off-policy algorithms for faster improvement.
arXiv Detail & Related papers (2023-03-30T14:08:31Z) - Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online
Fine-Tuning [104.05522247411018]
offline reinforcement learning (RL) methods tend to behave poorly during fine-tuning.
We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning.
In practice, Cal-QL can be implemented on top of the conservative Q learning (CQL) for offline RL within a one-line code change.
arXiv Detail & Related papers (2023-03-09T18:31:13Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Offline-to-Online Reinforcement Learning via Balanced Replay and
Pessimistic Q-Ensemble [135.6115462399788]
Deep offline reinforcement learning has made it possible to train strong robotic agents from offline datasets.
State-action distribution shift may lead to severe bootstrap error during fine-tuning.
We propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples.
arXiv Detail & Related papers (2021-07-01T16:26:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.