Fugu-MT 論文翻訳(概要): GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

論文の概要: GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

arxiv url: http://arxiv.org/abs/2605.11853v2
Date: Thu, 14 May 2026 10:19:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 15:19:49.89362
Title: GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Title（参考訳）: GEAR: 自己蒸留によるLLM剤の粒度適応型アドバンテージリヘアリング
Authors: Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang,
Abstract要約: Granularity-AdaptivE Advantage Reweightingはトークンレベルの信号とセグメントレベルの信号を使って、軌跡レベルのGRPOの利点を再評価する。 GEARは、標準のGRPO、自己蒸留のみのベースライン、トークンまたはターンレベルのクレジット割り当てメソッドを一貫して上回っている。
参考スコア（独自算出の注目度）: 33.370957547486775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
Abstract（参考訳）: 強化学習はLLMエージェントの訓練後アプローチとして広く使われており、トレーニングは通常、粗い監督のみを提供する結果レベルの報酬に頼っている。よりきめ細かいクレジット割り当ては効果的な政策更新を約束するが、信頼できるローカルクレジットを取得し、長い水平軌道の正しい部分に割り当てることは、依然としてオープンな課題である。本稿では,自己蒸留から導出されるトークンレベルおよびセグメントレベル信号を用いて,軌道レベルGRPOの利点を再評価する適応粒度信用割当フレームワークであるグラニュラリティ・アダプティブEアドバンテージ・リハイトリング(GEAR)を提案する。 GEARは、現場の学生と地味な教師を比較して、適応的なセグメント境界を識別し、局所的な有利な重みを調節する基準誘導発散信号を得る。この発散は意味的偏差の開始時にしばしばスパイクするが、後に同じ自己回帰連続におけるトークンは発散が低くなる。したがって、GEARはこれらのスパイクを適応的な信用領域のアンカーとして扱う: 学生が教師と整列し続けている場合、トークンレベルの解決は保持される; GEARは、対応する継続を適応セグメントにグループ化し、セグメントの利点を変調するために出発点での発散を使用する。 Qwen3 4B と 8B モデルによる8つの数学的推論およびエージェントツール使用ベンチマークによる実験により、GEAR は標準GRPO、自己蒸留のみのベースライン、トークンレベルまたはターンレベルのクレジット割り当て手法を一貫して上回っていることが示された。特にGRPOベースライン精度の低いベンチマークでは、GRPOよりも最大で20倍の精度で上昇し、提案された適応的再重み付けスキームは、より困難なロングホライゾン設定において特に有用であることが示唆されている。

論文の概要: GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

関連論文リスト