LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
- URL: http://arxiv.org/abs/2603.02146v1
- Date: Mon, 02 Mar 2026 18:07:53 GMT
- Title: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
- Authors: Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing,
- Abstract summary: We introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward.<n>This auxiliary signal directly incentivizes the model for selecting the correct grounding information.<n>LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks.
- Score: 51.45138356629732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
Related papers
- ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL [64.77036363086519]
We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks.<n>We provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives.<n>We also introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups.
arXiv Detail & Related papers (2026-02-26T04:55:57Z) - Beyond Correctness: Learning Robust Reasoning via Transfer [51.403609251508904]
We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it.<n>We introduce Reinforcement Learning with Transferable Reward, which operationalizes robustness via transfer reward.<n>Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps.
arXiv Detail & Related papers (2026-02-09T10:41:44Z) - Document Reconstruction Unlocks Scalable Long-Context RLVR [60.74632963522131]
Reinforcement Learning with Verifiable Rewards(RLVR) has become a prominent paradigm to enhance the capabilities (i.e. long-context) of Large Language Models(LLMs)<n>We investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.<n>We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBenchv2.
arXiv Detail & Related papers (2026-02-09T03:23:23Z) - Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning [82.91265691530351]
A$2$D is an Adaptive Ability Decomposing method for enhancing the effectiveness ofReinforcement Learning with verifiable rewards.<n>We first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions.<n>Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance.
arXiv Detail & Related papers (2026-01-31T14:48:23Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning [3.437656066916039]
Reinforcement with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities.<n>We investigate RLVR on two problems with fully verifiable solutions.<n>We find that RLVR improves evaluation metrics but often by reinforcing superficial Learning metrics rather than acquiring new reasoning strategies.
arXiv Detail & Related papers (2025-10-30T23:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.