Fugu-MT 論文翻訳(概要): TIP: Token Importance in On-Policy Distillation

論文の概要: TIP: Token Importance in On-Policy Distillation

arxiv url: http://arxiv.org/abs/2604.14084v2
Date: Sun, 19 Apr 2026 02:47:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 13:51:31.115746
Title: TIP: Token Importance in On-Policy Distillation
Title（参考訳）: TIP : オンライン蒸留におけるトークンの重要性
Authors: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard,
Abstract要約: 情報トークンは,高校生のエントロピーが高い位置と学生のエントロピーが低い位置の2つの領域から得られることを示す。 TIP (Token Importance in On-Policy distillation), 学生エントロピーに対する2軸分類法, 教師の分散傾向を整理した。本画像は,MATH-500およびAIME 2024/2025におけるQwen3,Llama,Qwen2.5にまたがる3つの教師学生対と,長期エージェント計画のためのDeepPlanningベンチマークで検証した。
参考スコア（独自算出の注目度）: 20.04756350098974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
Abstract（参考訳）: オンライン知識蒸留(On-policy knowledge distillation、OPD)は、教師のトークンレベルの監督の下で、学生を自身のロールアウトで訓練する。すべてのトークンの位置が等しく重要であるわけではないが、トークンの重要性に関する既存の見解は不完全である。我々は直接質問する:どのトークンがOPDでもっとも有用な学習信号を持っているか? 我々の答えは、情報トークンは、生徒のエントロピーが高い位置と、学生のエントロピーが低い位置と、生徒が過度に自信と間違いを負う学生の発散率の高い位置の2つの領域から来ているということだ。生徒のエントロピーは、エントロピーベースのサンプリングマッチを持つトークンの50\%を保持、あるいは全学トレーニングを超え、ピークメモリを最大4,7\%まで削減するという、強力なファーストオーダープロキシである。しかし、エントロピーだけでも第2の重要領域を見逃している。低エントロピー、高発散トークンを分離すると、すべてのトークンの10セント未満のトレーニングは、ほぼ完全なベースラインと一致し、過信トークンはエントロピーのみのルールにほとんど見えないにもかかわらず、密集した修正信号を運ぶことを示す。 TIP (Token Importance in On-Policy distillation, a two-axis taxonomy over students entropy and teacher-student divergence) を用いてこれらの知見を整理し, エントロピーが構造的に不完全である理由を理論的に説明する。この見解は、不確実性と不一致を組み合わせたタイプアウェアトークン選択ルールを動機付けている。我々は,Qwen3,Llama,Qwen2.5にまたがる,MATH-500とAIME 2024/2025にまたがる3つの教師学生ペアと,Q3のみのトークンの額がフルトークンPDを超えている長期エージェント計画のためのDeepPlanningベンチマークで,この図を検証した。 OPDリポジトリ https://github.com/HJSang/OPSD_OnPolicyDistillation を拡張して,GPU予算に制限のある大規模モデルのメモリ効率の高い蒸留をサポートする実験を行った。

論文の概要: TIP: Token Importance in On-Policy Distillation

関連論文リスト