Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- URL: http://arxiv.org/abs/2510.18176v1
- Date: Mon, 20 Oct 2025 23:58:31 GMT
- Title: Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Authors: Soumya Rani Samineni, Durgesh Kalwar, Vardaan Gangal, Siddhant Bhambri, Subbarao Kambhampati,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks.<n>We investigate the effect of RL post-training on intermediate tokens which are not directly incentivized.
- Score: 13.626335241662977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant attention. Existing RLVR methods, however, typically treat all tokens uniformly without accounting for token-level advantages. These methods primarily evaluate performance based on final answer correctness or Pass@K accuracy, and yet make claims about RL post-training leading to improved reasoning traces. This motivates our investigation into the effect of RL post-training on intermediate tokens which are not directly incentivized. To study this, we design an experimental setup using the GRPO algorithm with Qwen-2.5-0.5B model on the GSM8K dataset. We introduce trace coherence, a First-Order Logic (FOL)-based measure to capture the consistency of reasoning steps by identifying errors in the traces. We distinguish between trace validity and trace coherence, noting that the former implies logical soundness while the latter measures local coherence via lack of errors. Our results show that RL post-training overall improves trace coherence with the most significant gains on problems where the base model fails but the RL model succeeds. Surprisingly, RL enhances local coherence without necessarily producing valid or correct solutions. This highlights a crucial distinction: improved local coherence in reasoning steps does not guarantee final answer correctness. We argue that claims of improved reasoning via RL must be examined with care, as these may be based on improved trace coherence, which may not translate into fully valid mathematical proofs.
Related papers
- Learn Hard Problems During RL with Reference Guided Fine-tuning [56.56461712665904]
Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity.<n>We introduce Reference-Guided Fine-Tuning (ReGFT) to synthesize positive trajectories on hard problems and train on them before RL.<n>Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
arXiv Detail & Related papers (2026-03-01T18:41:28Z) - Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning [59.76691952347156]
Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs)<n>Most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions.<n>We propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL.
arXiv Detail & Related papers (2026-01-26T21:38:20Z) - Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning [14.57256913655025]
We propose Cog-Rethinker, a novel hierarchical metacognitive RL framework for LLM reasoning.<n>Our Cog-Rethinker mainly focuses on the rollout procedure in RL training.<n>By leveraging human cognition during solving problems, it prompts policy to decompose zero-accuracy problems into subproblems.
arXiv Detail & Related papers (2025-10-13T08:16:21Z) - ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs [32.13266235550995]
Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs)<n>Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates.
arXiv Detail & Related papers (2025-09-22T13:00:35Z) - Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL [19.659532349434418]
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models.<n>Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness.<n>We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals.
arXiv Detail & Related papers (2025-09-07T11:52:18Z) - Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty [59.97939500426759]
This paper describes RLCR, an approach to training reasoning models that jointly improves accuracy and confidence estimation.<n>We show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy.<n>We also demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration.
arXiv Detail & Related papers (2025-07-22T17:56:01Z) - RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs [35.27561531876348]
This paper systematically investigates the impact of Reinforcement Learning with Verifiable Rewards (RLVR) on Large Language Models (LLMs)<n>We show that RLVR can extend the reasoning boundary for both mathematical and coding tasks.<n>We present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness.
arXiv Detail & Related papers (2025-06-17T07:06:56Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.