Related papers: Not All Steps are Informative: On the Linearity of LLMs' RLVR Training

Not All Steps are Informative: On the Linearity of LLMs' RLVR Training

URL: http://arxiv.org/abs/2601.04537v1
Date: Thu, 08 Jan 2026 03:06:18 GMT
Title: Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
Authors: Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao,
Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training.<n>We investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training.<n>We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation.
Score: 14.59942263367421
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.

Related papers

Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories [33.872433985210876]
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories.<n>We propose Discover, Lea rn and Reinforce, which generates multiple distinct, high-success behavioral patterns for VLA pretraining.<n>When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets.
arXiv Detail & Related papers (2025-11-24T07:54:49Z)
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning [49.22815446849924]
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning.<n>For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled.<n>We propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions"
arXiv Detail & Related papers (2025-10-29T22:05:08Z)
RL in the Wild: Characterizing RLVR Training in LLM Deployment [43.81962834561768]
Reinforcement Learning with Verifiable Rewards (RLVR) has surged in recent months to enhance their reasoning and understanding abilities.<n>However, its complex data flows and diverse tasks pose substantial challenges to RL training systems.<n>There is limited understanding of RLVR from a system perspective.
arXiv Detail & Related papers (2025-09-29T03:09:27Z)
Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z)
Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle [66.80133103857703]
Reinforcement Learning (RL) has markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs)<n>This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs.
arXiv Detail & Related papers (2025-09-20T13:11:28Z)
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning [81.7764584515496]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation.<n>These models face two fundamental challenges: scarcity and high cost of large-scale human-operated robotic trajectories.<n>We introduce SimpleVLA-RL, an efficient reinforcement learning framework tailored for VLA models.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL [14.506189610798929]
Reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations.<n>First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine.<n>Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy.
arXiv Detail & Related papers (2025-08-26T01:42:46Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [66.61292196146016]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.