Fugu-MT 論文翻訳(概要): LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

論文の概要: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

arxiv url: http://arxiv.org/abs/2603.02146v1
Date: Mon, 02 Mar 2026 18:07:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-03 19:50:57.022122
Title: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Title（参考訳）: LongRLVR: 検証可能なコンテキストリワードを必要とするLong-Context Reinforcement Learning
Authors: Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing,
Abstract要約: 我々は,疎解報酬を高密度で検証可能な文脈報酬で増強するためにLongRLVRを導入する。この補助信号は、正しい接地情報を選択するためのモデルを直接インセンティブ化する。 LongRLVRは、すべてのモデルとベンチマークで標準のRLVRよりも一貫して、大幅に優れています。
参考スコア（独自算出の注目度）: 51.45138356629732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は,Large Language Models(LLM)の推論能力を,現実的な結果に対して最適化することによって大幅に向上した。しかし、このパラダイムは、内的パラメトリック知識への依存が文脈的根拠を必要とするタスク、すなわち外部から提供された情報を見つけ出し、推論する能力に不適であるため、長いコンテキストのシナリオで失敗する。最終回答のみに基づく報酬は、関係する証拠を特定するためにモデルを効果的に導くには不十分です。我々は、結果のみの報酬が文脈基底過程の顕著な消失勾配につながることを正式に証明し、学習を難解にレンダリングする。このボトルネックを克服するために、疎解報酬を高密度で検証可能なコンテキスト報酬で増強するLongRLVRを導入する。この補助信号は、正しい接地情報を選択するためのモデルに直接インセンティブを与え、基礎となる最適化課題を解決する頑健な学習勾配を提供する。我々はQwenモデルとLLaMAモデルを用いて、長文のベンチマークに挑戦する手法を検証する。例えば、RULER-QAのスコアは73.17から88.90に、LongBench v2は39.8から46.5に向上した。我々の研究は、LLMの長期的応用における完全な推論可能性の解き放つ上で、グラウンド化プロセスに明示的に報いることが重要かつ効果的な戦略であることを実証している。私たちのコードはhttps://github.com/real-absolute-AI/LongRLVRで利用可能です。

論文の概要: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

関連論文リスト