Related papers: Bridging Offline and Online Reinforcement Learning for LLMs

Bridging Offline and Online Reinforcement Learning for LLMs

URL: http://arxiv.org/abs/2506.21495v1
Date: Thu, 26 Jun 2025 17:25:49 GMT
Title: Bridging Offline and Online Reinforcement Learning for LLMs
Authors: Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov,
Abstract summary: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online.<n>Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both.
Score: 71.48552761763158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

Related papers

Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z)
RAISE: Reinforced Adaptive Instruction Selection For Large Language Models [48.63476198469349]
We propose a task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection)<n> RAISE incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement.<n>Experiments and result analysis prove the superiority of our method compared with other instruction selection methods.
arXiv Detail & Related papers (2025-04-09T21:17:52Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Teaching LLMs to Refine with Tools [68.23479664749271]
Large language models (LLMs) can refine their responses based on feedback, enabling self-improvement through iterative training or test-time refinement.<n>We propose CaP, a novel approach that uses external tools to refine chain-of-thought (CoT) responses generated by the same or other LLMs.
arXiv Detail & Related papers (2024-12-22T05:43:50Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Scaling Combinatorial Optimization Neural Improvement Heuristics with Online Search and Adaptation [0.40964539027092917]
We introduce Limited Rollout Beam Search (LRBS), a beam search strategy for deep reinforcement learning (DRL) based optimization improvements.<n>We show that LRBS significantly enhances both in-distribution performance and generalization to larger problem instances.<n>We also employ our search strategy for offline and online adaptation of the pre-trained improvement policy, leading to improved search performance.
arXiv Detail & Related papers (2024-12-13T14:25:27Z)
Learning Goal-Conditioned Policies from Sub-Optimal Offline Data via Metric Learning [22.174803826742963]
We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning. We propose the use of metric learning to approximate the optimal value function for goal-conditioned offline RL problems. We show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors.
arXiv Detail & Related papers (2024-02-16T16:46:53Z)
Context-Former: Stitching via Latent Conditioned Sequence Modeling [31.250234478757665]
We introduce ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectories. Experiments show ContextFormer can achieve competitive performance in multiple IL settings.
arXiv Detail & Related papers (2024-01-29T06:05:14Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Semi-Offline Reinforcement Learning for Optimized Text Generation [35.1606951874979]
In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings.
arXiv Detail & Related papers (2023-06-16T09:24:29Z)
Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control [7.242569453287703]
We first conduct a comparative study of two prevalent approaches: offline supervised learning and online direct policy optimization. Our results underscore the superiority of offline supervised learning in terms of both optimality and training time. We propose the Pre-train and Fine-tune strategy as a unified training paradigm for optimal feedback control.
arXiv Detail & Related papers (2022-11-29T05:07:13Z)
Offline Preference-Based Apprenticeship Learning [11.21888613165599]
We study how an offline dataset can be used to address two challenges that autonomous systems face when they endeavor to learn from, adapt to, and collaborate with humans. First, we use the offline dataset to efficiently infer the human's reward function via pool-based active preference learning. Second, given this learned reward function, we perform offline reinforcement learning to optimize a policy based on the inferred human intent.
arXiv Detail & Related papers (2021-07-20T04:15:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.