Related papers: Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

URL: http://arxiv.org/abs/2506.05316v1
Date: Thu, 05 Jun 2025 17:55:43 GMT
Title: Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Authors: Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang,
Abstract summary: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs)<n>We propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay.<n>Our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
Score: 61.823835392216544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.

Related papers

LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment [14.655048266761783]
Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck.<n>We present LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training.<n> Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements.
arXiv Detail & Related papers (2025-06-13T06:05:58Z)
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs [12.087316618902433]
Reasoning large language models (LLMs) excel in complex tasks.<n>Existing approaches allocate an equal number of rollouts to all questions during reinforcement learning (RL)<n>We propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems.
arXiv Detail & Related papers (2025-05-24T07:28:29Z)
LLM-Independent Adaptive RAG: Let the Question Speak for Itself [47.60917219813637]
Large Language Models (LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps this, but at a high computational cost while risking misinformation.<n>In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information.
arXiv Detail & Related papers (2025-05-07T08:58:52Z)
Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data [32.7248232143849]
Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL)<n>This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments.
arXiv Detail & Related papers (2025-02-26T20:34:29Z)
Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset [29.573555134322543]
offline reinforcement learning (RL) allows agents to learn from pre-collected datasets without further interaction with the environment.<n>A key, yet underexplored, challenge in offline RL is selecting an optimal subset of the offline dataset.<n>We introduce ReDOR, a method that frames dataset selection as a gradient approximation optimization problem.
arXiv Detail & Related papers (2025-02-26T09:08:47Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
Adaptive Data Exploitation in Deep Reinforcement Learning [50.53705050673944]
We introduce ADEPT, a powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL)<n>Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms.<n>We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-01-22T04:01:17Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online. We extensively ablate these design choices, demonstrating the key factors that most affect performance. We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.