The Path Not Taken: RLVR Provably Learns Off the Principals
- URL: http://arxiv.org/abs/2511.08567v1
- Date: Wed, 12 Nov 2025 02:05:16 GMT
- Title: The Path Not Taken: RLVR Provably Learns Off the Principals
- Authors: Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai,
- Abstract summary: We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
- Score: 85.41043469428365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.
Related papers
- GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR [10.820638016337869]
We propose GeoRA, which exploits the anisotropic and compressible nature of RL update subspaces.<n>GeoRA mitigates optimization bottlenecks caused by geometric misalignment.<n>It consistently outperforms established low-rank baselines on key mathematical benchmarks.
arXiv Detail & Related papers (2026-01-14T10:41:34Z) - High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z) - Evaluating Parameter Efficient Methods for RLVR [38.45552186628944]
Reinforcement Learning with Verifiable Rewards (RLVR) incentivizes language models to enhance their reasoning capabilities through verifiable feedback.<n>While methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified.<n>We conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-12-29T03:13:08Z) - Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning [93.19037653970622]
We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images.<n>Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities.<n>Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
arXiv Detail & Related papers (2025-10-31T16:30:08Z) - On Predictability of Reinforcement Learning Dynamics for Large Language Models [20.320268628019047]
This work identifies two fundamental properties of RL-induced parameter updates in large language models.<n>We propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window.
arXiv Detail & Related papers (2025-10-01T06:13:50Z) - The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features [1.7832672957068079]
We introduce Feature Steering with Reinforcement Learning, a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features.<n>We show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes.<n>Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
arXiv Detail & Related papers (2025-09-16T10:32:40Z) - The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [57.56453588632619]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z) - Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models [0.0]
It is often assumed thatReinforcement learning (RL) fine-tuning requires updating most of a model's parameters.<n>We call this phenomenon RL-induced parameter update sparsity.<n>We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model.
arXiv Detail & Related papers (2025-07-23T01:02:17Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - POAR: Efficient Policy Optimization via Online Abstract State
Representation Learning [6.171331561029968]
State Representation Learning (SRL) is proposed to specifically learn to encode task-relevant features from complex sensory data into low-dimensional states.
We introduce a new SRL prior called domain resemblance to leverage expert demonstration to improve SRL interpretations.
We empirically verify POAR to efficiently handle tasks in high dimensions and facilitate training real-life robots directly from scratch.
arXiv Detail & Related papers (2021-09-17T16:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.