Related papers: Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study

URL: http://arxiv.org/abs/2505.02142v1
Date: Sun, 04 May 2025 15:09:49 GMT
Title: Exploring the Potential of Offline RL for Reasoning in LLMs: A Preliminary Study
Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang Li,
Abstract summary: Long-context reasoning by large language models (LLMs) incurs substantial computational costs and complexity.<n>We investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO.<n>Experiments demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3%.
Score: 16.441081996257576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant advances in long-context reasoning by large language models (LLMs), primarily through Online Reinforcement Learning (RL) methods, these approaches incur substantial computational costs and complexity. In contrast, simpler and more economical Offline RL methods remain underexplored. To address this gap, we investigate the effectiveness of Offline RL methods, specifically Direct Preference Optimization (DPO) and its length-desensitized variant LD-DPO, in enhancing the reasoning capabilities of LLMs. Extensive experiments across multiple reasoning benchmarks demonstrate that these simpler Offline RL methods substantially improve model performance, achieving an average enhancement of 3.3\%, with a particularly notable increase of 10.1\% on the challenging Arena-Hard benchmark. Furthermore, we analyze DPO's sensitivity to output length, emphasizing that increasing reasoning length should align with semantic richness, as indiscriminate lengthening may adversely affect model performance. We provide comprehensive descriptions of our data processing and training methodologies, offering empirical evidence and practical insights for developing more cost-effective Offline RL approaches.

Related papers

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model [57.20064815347607]
offline reinforcement learning (RL) recently gains growing interests from RL researchers.<n>The performance of offline RL suffers from the out-of-distribution problem, which can be corrected by feedback in online RL.<n>In this paper, we first build a bridge over the batch data and the performance of offline RL algorithms theoretically.<n>We show that in task-agnostic settings, a series of policies trained by unsupervised RL can minimize the worst-case regret in the performance gap.
arXiv Detail & Related papers (2025-06-24T14:08:36Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation [29.579349371114702]
Direct Preference Optimization (DPO) is a cost-effective alternative to reinforcement learning (RL) for large language models (LLMs)<n>We show that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance.<n>With simple verifiable rewards, our model achieves RL-level performance with significantly lower computational overhead.
arXiv Detail & Related papers (2025-03-17T06:28:25Z)
Yes, Q-learning Helps Offline In-Context RL [69.26691452160505]
We show that optimizing RL objectives improves performance by approximately 40% on average compared to the widely established Algorithm Distillation (AD) baseline.<n>Our results also reveal that offline RL-based methods outperform online approaches, which are not specifically designed for offline scenarios.
arXiv Detail & Related papers (2025-02-24T21:29:06Z)
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm.<n>DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy.<n>Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z)
Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task [10.046325073900297]
We introduce an augmented offline RL dataset for Abstraction and Reasoning (SOLAR) SOLAR enables the application of offline RL methods by offering sufficient experience data. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task.
arXiv Detail & Related papers (2024-10-15T06:48:27Z)
Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization [1.631115063641726]
We propose a framework that enhances PPO algorithms by incorporating a diffusion model to generate high-quality virtual trajectories for offline datasets.<n>Our contributions are threefold: we explore the potential of diffusion models in RL, particularly for offline datasets, extend the application of online RL to offline environments, and experimentally validate the performance improvements of PPO with diffusion models.
arXiv Detail & Related papers (2024-09-02T19:10:32Z)
Pessimistic Model Selection for Offline Deep Reinforcement Learning [56.282483586473816]
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee.
arXiv Detail & Related papers (2021-11-29T06:29:49Z)
Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE) MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z)
Instabilities of Offline RL with Pre-Trained Neural Representation [127.89397629569808]
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are.
arXiv Detail & Related papers (2021-03-08T18:06:44Z)
MOReL : Model-Based Offline Reinforcement Learning [49.30091375141527]
In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. We present MOReL, an algorithmic framework for model-based offline RL. We show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.
arXiv Detail & Related papers (2020-05-12T17:52:43Z)
Comprehensive Review of Deep Reinforcement Learning Methods and Applications in Economics [9.080472817672264]
DRL is characterized by scalability with the potential to be applied to high-dimensional problems in conjunction with noisy and nonlinear patterns of economic data. The architecture of DRL applied to economic applications is investigated in order to highlight the complexity, robustness, accuracy, performance, computational tasks, risk constraints, and profitability.
arXiv Detail & Related papers (2020-03-21T14:07:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.