Fugu-MT 論文翻訳(概要): Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

論文の概要: Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

arxiv url: http://arxiv.org/abs/2604.11056v1
Date: Mon, 13 Apr 2026 06:32:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.364581
Title: Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Title（参考訳）: RLVRにおけるトークンレベルクレジット割り当ての再考:極性エントロピー分析
Authors: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang,
Abstract要約: RLVRによる強化学習は大規模言語モデル(LLM)の推論能力を大幅に向上させた我々は、報酬極性とトークンエントロピーのジョイントレンズを用いてこの問題を分析する。トークンレベルの学習信号を変調するエントロピー・アウェア・ポリシー最適化(EAPO)を提案する。
参考スコア（独自算出の注目度）: 33.07421874137999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は,Large Language Models (LLM) の推論能力を大幅に向上させた。しかし、その粗末な結果に基づく報酬は、基本的な信用割当問題を引き起こす。我々は、報酬極性とトークンエントロピーのジョイントレンズを用いてこの問題を分析する。我々の診断ツールであるFour Quadrant Decompositionは、極性とエントロピーによるトークン更新を分離し、制御された改善は高エントロピー二次体に集中していることを示す。この観察を理論的に正当化するために、条件付き相互情報を自己回帰RLVR設定に適応させ、トークンが持てるクレジットがエントロピーによって上界であることが証明する。この見解は、推論のゲインは主に高エントロピートークンから発生し、正と負の更新にユニークな役割を持つという検証可能な予測をもたらす。 GRPOの勾配解析により、高エントロピー位置で均一に伝送するダイリュート信号が、決定論的トークンを過剰にクレディットしながらどのように振る舞うかが明らかになる。これらの知見に基づいて,トークンレベルの学習信号を変調するエントロピー対応政策最適化(EAPO)を提案する。大規模な実験により、EAPOは2つのモデルファミリーで強いベースラインを上回ります。

論文の概要: Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

関連論文リスト