Fugu-MT 論文翻訳(概要): Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

論文の概要: Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

arxiv url: http://arxiv.org/abs/2605.09253v1
Date: Sun, 10 May 2026 01:41:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.145431
Title: Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Title（参考訳）: コーナストーンかタンブリングブロックか? オンライン蒸留における岩石トークンの解読
Authors: Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang,
Abstract要約: On-Policy DistillationのKL目標に基づく学生と教師のミスマッチの最も直接的なシグナルとして,トークンタイプの高損失トークンについて検討する。これらのトークンは、モデルの実際の推論性能に無視可能な機能的貢献を提供する。
参考スコア（独自算出の注目度）: 4.624042537090342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) の最近の研究は、重要なトークンの小さなサブセットが推論ゲインを不均等に駆動することを示したが、オン・ポリシィ蒸留(OPD)の類似のトークンレベルの理解はいまだにほとんど解明されていない。本研究は, 学生と教師のミスマッチの最も直接的なシグナルであるトークンタイプである高損失トークンを, 既存の研究によって学習が収束するにつれて, 徐々に減少していくことを示すものである。 OPDトレーニングが明らかに飽和に達した後も、トークンのかなりのサブセットは持続的に高い損失を示し続けており、これらトークンはRock Tokensと呼ばれ、生成された出力のトークンの最大18%を占めることができる。我々の調査では、2つの急激なパラドックスが明らかになっている。第一に、高い発生頻度が全勾配ノルムを不均等に多用しているにもかかわらず、ロック・トーケンズ自身は教師主導の修正に抵抗し、訓練を通して停滞している。第二に、因果的介入により、これらのトークンはモデルの実際の推論性能に無視できる機能的貢献を提供する。これらの結果は,学生モデルが内部化できない,あるいは必要としない構造的および談話的残差に対して,膨大な最適化帯域幅が費やされていることを示唆している。これらの力学をデコンストラクテーションすることにより、これらの「振動ブロック」を戦略的にバイパスすることで、アライメントプロセスを大幅に効率化することができ、均一なトークン重み付けの必要性に挑戦し、大規模モデルの蒸留のためのより効率的なパラダイムを提供することを示した。

論文の概要: Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

関連論文リスト