Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
- URL: http://arxiv.org/abs/2505.17988v3
- Date: Tue, 05 Aug 2025 11:46:13 GMT
- Title: Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
- Authors: Yutong Chen, Jiandong Gao, Ji Wu,
- Abstract summary: R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities.<n>We found that small-scale SFT has substantial influence on RL but shows poor efficiency.<n>We propose textbfRe-distillation, a technique that aims to boost the effectiveness of small-scale distillation.
- Score: 4.967941028121525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: R1-style Reinforcement Learning (RL) significantly enhances Large Language Models' reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen\&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K\&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.
Related papers
- Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.30596996677882]
We investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models.<n> scaling strategies yield notable improvements in reasoning performance.<n>Our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and new state-of-the-art performance among Qwen2.5-7B-based reasoning models.
arXiv Detail & Related papers (2025-06-16T09:27:48Z) - RAST: Reasoning Activation in LLMs via Small-model Transfer [33.32587030836428]
Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs)<n>Applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads.<n>We propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models.
arXiv Detail & Related papers (2025-05-30T17:57:08Z) - Behavior Injection: Preparing Language Models for Reinforcement Learning [24.46625106928253]
Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs)<n>LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade.<n>We propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL.
arXiv Detail & Related papers (2025-05-25T00:54:50Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - Reinforcement Learning for Reasoning in Large Language Models with One Training Example [129.11571295790807]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs)<n>We identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated.
arXiv Detail & Related papers (2025-04-29T09:24:30Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z) - LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples.<n>Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset.<n>For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models.<n>We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z) - Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning [47.17030172520195]
The sample efficiency for Multi-Agent Reinforcement Learning (MARL) is more challenging because of its inherent partial observability, non-stationary training, and enormous strategy space.
We argue that the widely used episodic training mechanism could be a source of poor sample efficiency.
To better exploit the data already collected, we propose to increase the frequency of the gradient updates per environment interaction.
arXiv Detail & Related papers (2024-04-15T12:18:09Z) - Application of Knowledge Distillation to Multi-task Speech
Representation Learning [2.0908300719428228]
Speech representation learning models use a large number of parameters, the smallest version of which has 95 million parameters.
In this paper, we investigate the application of knowledge distillation to speech representation learning models followed by fine-tuning.
Our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation.
arXiv Detail & Related papers (2022-10-29T14:22:43Z) - Learning to Prune Deep Neural Networks via Reinforcement Learning [64.85939668308966]
PuRL is a deep reinforcement learning based algorithm for pruning neural networks.
It achieves sparsity and accuracy comparable to current state-of-the-art methods.
arXiv Detail & Related papers (2020-07-09T13:06:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.