Fugu-MT 論文翻訳(概要): Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

論文の概要: Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

arxiv url: http://arxiv.org/abs/2509.23808v1
Date: Sun, 28 Sep 2025 11:14:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.46154
Title: Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
Title（参考訳）: RLVRにおけるLLM推論のための隠れ状態アプローチ
Authors: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang,
Abstract要約: RLVR(Reinforcement Learning for Verifiable Rewards)の一般的な見解は、探索・探索トレードオフのレンズを通して最近の進歩を解釈している。我々はこの視点を再検討し、この認識されたトレードオフは基本的な制約ではなく、測定レベルの成果物である可能性を示唆している。本稿では,相乗的探索・探索強化の原理を最初に運用する,Velocity-Exploiting Rank-Learning (VERL)を提案する。
参考スコア（独自算出の注目度）: 15.147456927849932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
Abstract（参考訳）: RLVR(Reinforcement Learning for Verifiable Rewards)の一般的な見解は、トークンレベルのメトリクスによって主に形作られた、探索・探索トレードオフのレンズを通して、最近の進歩を解釈するものである。我々はこの視点を再検討し、この認識されたトレードオフは基本的な制約ではなく、測定レベルの成果物である可能性を示唆している。そこで本研究では,エフェクトランク速度 (ERV) とエフェクトランク加速度 (ERA) と呼ばれる新しい1階と2階の導関数の定量化と提案を行うために,エフェクトランク (ER) を適用した。分析の結果,隠れ状態のレベルでは,探索と搾取を分離できることが判明した(第4報)。この発見は、両方の能力を同時に増強する機会を明らかにします。この知見は我々の手法であるVERL(Velocity-Exploiting Rank-Learning)を動機付け、RLの優位関数を直接形作ることにより、相乗的探索-探索強化の原理を最初に運用する。鍵となる革新は、理論上安定なERAを予測メタコントローラとして活用して、相乗的で二重チャネルのインセンティブ構造を構築することである。 VERLは、トレードオフを強制する代わりに、過剰な自信を抑えるために探索に対する報酬を前向きに増幅し、推論を統合するために搾取利得を強化する。多様なLCMと推論ベンチマークによる実験では、挑戦的なGaokao 2024データセットにおいて、最大21.4%の絶対精度の改善を含む、一貫した利得を示している。

論文の概要: Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

関連論文リスト