Fugu-MT 論文翻訳(概要): TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

論文の概要: TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

arxiv url: http://arxiv.org/abs/2605.10194v1
Date: Mon, 11 May 2026 08:45:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.660752
Title: TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
Title（参考訳）: TRACE:Token-Routed Self On-Policy Alignmentを通じて重要な場所を蒸留する
Authors: Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo,
Abstract要約: On-policy self-distillation (self-OPD)は、政策が特権的文脈下で自らを教えることによって、強化学習を検証可能な報酬(RLVR)で強化する。本稿では, 注釈付き臨界スパンのみを蒸留する, 臨界rEasoning (TRACE) のためのToken-Routed Alignmentを提案する。我々の分析では、TRACEは2つの効果によって説明されている: フォワードKLは、学生が下位に配置する教師支援トークンに対して、無消毒リフトを提供するが、マスキングと崩壊は累積特権-段階的露出を有限に保っている。
参考スコア（独自算出の注目度）: 20.277178104190536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.
Abstract（参考訳）: On-policy self-distillation (self-OPD)は、政策が特権的文脈下で自らを教えることによって、強化学習を検証可能な報酬(RLVR)で強化する。このガイダンスが全応答にまたがる場合、全トークンKLは、主に冗長な位置の勾配に費やし、特権情報漏洩を増幅し、エントロピーの上昇、推論の短縮、長距離数学トレーニングにおける分布外劣化を引き起こす。 token-Routed Alignment for critical rEasoning (TRACE) を提案する。これは、アノテータにマークされた臨界スパンのみを蒸留し、正しいロールアウトのキースパン上の前方KL、ローカライズされたエラースパン上のオプション逆KL、残りトークンのGRPO、短いウォームアップ後にKLチャネルが焼鈍される。我々の分析では、TRACEは2つの効果によって説明されている: フォワードKLは、学生が下位に配置する教師支援トークンに対して、無消毒リフトを提供するが、マスキングと崩壊は累積特権-段階的露出を有限に保っている。 4つのベンチマークとGPQA-ダイアモンドで、TRACEはGRPOを平均2.76ポイント以上改善し、GPQA-ダイアモンドでQwen3-8BベースのOODスコアを保持する。オンライン自己アノテーション(+1.90ポイント、強力なAPIゲインの約69%)の下でのゲインは継続し、TRACEが単に外部アノテータ機能をインポートしているという懸念を和らげる。 Qwen3-8Bではキースパンで前方KL、Qwen3-1.7Bではエラースパンで逆KLにシフトする。

論文の概要: TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

関連論文リスト