Fugu-MT 論文翻訳(概要): GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

論文の概要: GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2606.04889v1
Date: Wed, 03 Jun 2026 13:51:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.795687
Title: GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
Title（参考訳）: GRAIL: 検証されたリワードによる強化学習のためのグラディエント・リウェイト・アドバンテージ
Authors: Tej Deep Pala, Vernon Toh, Soujanya Poria,
Abstract要約: グラディエント・リウェイトド・アドバンテージ(GRAIL)は、固有のトークン・ワイド・アドバンテージ・リウェイト法である。 GRAILは勾配活性塩度を用いて、最終回答により局所的に敏感なトークンをより重み付けする。 GRAILの精度は平均3.60%向上し、Pass@3では3.05%向上した。
参考スコア（独自算出の注目度）: 36.68876802708284
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.
Abstract（参考訳）: 検証可能な報酬(例えばGRPO)による強化学習は、Large Language Models (LLMs)における数学的推論を改善する一般的な方法である。しかしながら、現在の手法は通常、すべてのトークンに対して1つのシーケンスレベルの利点をブロードキャストするか、ステップレベルの監視にコストのかかるプロセス報酬モデル(PRM)を使用する。均一な有利分布は、全てのトークンが最終報酬に等しく寄与すると仮定する。これは、欠陥のある推論ステップとフィラーワードが、論理的推論と同じくらい強く更新されるため、勾配信号を希釈する。これを解決するために,本発明のトークン・ワイド・アドバンテージ法であるグラディエント・リウェイト・アドバンテージ(GRAIL)を導入する。 GRAILは勾配活性塩度を用いて、最終回答により局所的に敏感なトークンをより重み付けする。 Qwen3、R1-distilled、OctoThinkerファミリーの5つのモデルによる評価は、GRAILがGRPOを一貫して上回っていることを示している。 GRAILの精度は平均3.60%向上し、Pass@3では3.05%向上した。

論文の概要: GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

関連論文リスト