On Predictability of Reinforcement Learning Dynamics for Large Language Models
- URL: http://arxiv.org/abs/2510.00553v2
- Date: Thu, 02 Oct 2025 15:16:51 GMT
- Title: On Predictability of Reinforcement Learning Dynamics for Large Language Models
- Authors: Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang,
- Abstract summary: This work identifies two fundamental properties of RL-induced parameter updates in large language models.<n>We propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window.
- Score: 20.320268628019047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
Related papers
- Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs [21.242959630751663]
We show that reinforcement learning can be substantially more parameter-efficient than previously recognized.<n>Experiments demonstrate that the substantially more memory-efficient SGD matches or even outperforms AdamW in RL for LLMs.
arXiv Detail & Related papers (2026-02-07T23:25:26Z) - Learning Dynamics in RL Post-Training for Language Models [2.538209532048867]
We analyze the learning dynamics of RL post-training from a perspective that has been studied in supervised learning but remains underexplored in RL.<n>We show that limited variability in feature representations can cause RL updates to systematically increase model confidence.<n>Motivated by these insights, we propose classifier-first reinforcement learning (CF-RL), a simple two-stage training strategy.
arXiv Detail & Related papers (2026-01-08T07:32:15Z) - The Path Not Taken: RLVR Provably Learns Off the Principals [85.41043469428365]
We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
arXiv Detail & Related papers (2025-11-11T18:49:45Z) - Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch [63.40752011615843]
Training tool-augmented language models has emerged as a promising approach to enhancing their capabilities for complex tasks.<n>We propose a dynamic generalization-guided reward design for rule-based reinforcement learning.<n>We show that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models.
arXiv Detail & Related papers (2025-11-02T16:33:45Z) - Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning [16.095629872564874]
Reinforcement learning is arguably the most prominent fine-tuning method.<n>Evolution strategies (ES) once showed comparable performance to RL on models with a few million parameters.<n>ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects.
arXiv Detail & Related papers (2025-09-29T07:19:34Z) - Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs [13.036236161537147]
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training.<n>A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves.<n>However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored.
arXiv Detail & Related papers (2025-09-25T11:51:05Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle [53.239242017802056]
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM)<n>However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing and Rollout Silencing.<n>We propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition.
arXiv Detail & Related papers (2025-08-07T17:53:47Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Reinforcement Learning Finetunes Small Subnetworks in Large Language Models [27.55599230411277]
Reinforcement learning (RL) yields substantial improvements in large language models downstream task performance and alignment with human values.<n>Surprisingly, such large gains result from updating only a small subnetwork comprising just 5 percent to 30 percent of the parameters.<n>We refer to this phenomenon as parameter update sparsity induced by RL.
arXiv Detail & Related papers (2025-05-16T21:42:28Z) - Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences.<n>We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one.<n>We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.