The Invisible Leash: Why RLVR May Not Escape Its Origin
- URL: http://arxiv.org/abs/2507.14843v1
- Date: Sun, 20 Jul 2025 07:04:08 GMT
- Title: The Invisible Leash: Why RLVR May Not Escape Its Origin
- Authors: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi,
- Abstract summary: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities.<n>This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR.<n>We identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions.
- Score: 48.915013455847856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
Related papers
- RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs [32.99709073885827]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs)<n>However, RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding.<n>We introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct.
arXiv Detail & Related papers (2025-06-17T07:06:56Z) - On the Mechanism of Reasoning Pattern Selection in Reinforcement Learning for Language Models [17.36077163968198]
We present a systematic study of Reinforcement Learning with Verifiable Rewards (RLVR)<n>We show that RLVR-trained models preferentially adopt high-success-rate reasoning patterns.<n>We develop theoretical analyses on the convergence and training dynamics of RLVR.
arXiv Detail & Related papers (2025-06-05T07:17:04Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [67.30809748319486]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [19.064630697040055]
Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning from base language models without explicit reasoning supervisions.<n>We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels.<n>Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering.
arXiv Detail & Related papers (2025-02-27T00:54:38Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Principled Knowledge Extrapolation with GANs [92.62635018136476]
We study counterfactual synthesis from a new perspective of knowledge extrapolation.
We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem.
Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
arXiv Detail & Related papers (2022-05-21T08:39:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.