Fugu-MT 論文翻訳(概要): Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

論文の概要: Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

arxiv url: http://arxiv.org/abs/2605.13643v1
Date: Wed, 13 May 2026 15:05:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.125777
Title: Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Title（参考訳）: 接尾辞・接尾辞・接尾辞・接尾辞・接尾辞・接尾辞・接尾辞
Authors: Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, Jieping Ye,
Abstract要約: オンライン蒸留は、より強い教師からの強いフィードバックを使って、学生モデルを独自のロールアウトで訓練する。我々は、この原則を軌跡固有のリリースルールで運用する。強弱蒸留作業による実験結果から, この放出規則は標準全軌道PDよりも一貫して優れていたことが示唆された。
参考スコア（独自算出の注目度）: 49.117085054884676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.
Abstract（参考訳）: オンライン蒸留(OPD)は、より強い教師からの強いフィードバックを用いて、学生モデルを自身のロールアウトで訓練する。以前の文献では、教師のフィードバックが得られ、レスポンストークンの完全なシーケンスを監督することは、パフォーマンスを単調に改善することを示唆している。しかし、この仮定は、時に強弱なOPD設定で保たないことを示す。後続のコースのセグメントは、教師と学生の非教師の優位性を示す可能性があるが、学生の学習の優先順位付けに有効な密着したフィードバックを局所的なコントラストが欠如していることが多い。この障害モードを局所的な教育可能性の崩壊と呼ぶ。教師の反応全体を均一にカバーするのではなく、教師のフィードバックが差別的のままである軌跡領域に集中すべきである。我々は、この原則を軌跡固有のリリースルールで運用する。このルールは、生徒の上位$Kの候補者に対する教師のマージンを測定し、このマージンをNLTK対応の文章セグメントに集約し、BICスタイルの下向きの変化点を検出することによって、密集したOPDの監督を断ち切る。 Qwen3モデルファミリーを用いた強弱蒸留タスクに対する実験結果から、このリリースルールは、様々な学生スケールで5つのドメイン内のベンチマークで標準フルトラジェクトリPDを上回っていることが示された。さらに, 本手法は, ベースライン蒸留法と比較して, ドメイン外タスクにおけるモデル機能をよりよく保存する。これらの結果から,効果的な強弱OPDは教師指導の利用可能性だけでなく,その地域的有用性も評価する必要があることが示唆された。

論文の概要: Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

関連論文リスト