Fugu-MT 論文翻訳(概要): When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

論文の概要: When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

arxiv url: http://arxiv.org/abs/2606.03532v1
Date: Tue, 02 Jun 2026 11:54:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.979299
Title: When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation
Title（参考訳）: 教師はいつ動くべきか : 時間的結合と自給自足蒸留の安定性
Authors: Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang,
Abstract要約: 本研究は,教師年齢ではなく,安定した学習を可能にする重要な構造特性として,教師の凍結期間が定義されていることを示す。我々は,報酬改善と長身安全の連立証拠を各リフレッシュしながら,孤立期間を保ったEmphConsolidation-Gated Teacher Refresh (CGTR)を提案する。
参考スコア（独自算出の注目度）: 11.653794727366957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.
Abstract（参考訳）: セルフオン政治蒸留は、教師のパラメーター履歴から派生した教師に対する生徒の政策を訓練するが、教師の更新スケジュールは、教師と生徒の間で「時間的結合」を規定するものであり、安定性変数として体系的に研究されていない。 Qwen3-8Bのスケジュールスイープを制御することにより,教師年齢ではなく安定した学習を可能にする重要な構造的特性として,更新の間を凍結する完全な教師として定義された 'emph{isolation periods' が確立される。これらのトレーニング力学を特徴付けるために、時間的KL構造、リフレッシュショック、長さテールリスクの診断フレームワークを導入する。このフレームワークはさらに、"emph{state-oblivious collapse}"を明らかにしている: 時間駆動のリフレッシュは、1つの不可逆的なステップで、過渡的にドリフトする生徒を教師にコピーできるため、長期のホライゾントレーニングで破滅的に失敗する最適な短期ホライゾン固定スケジュール。この障害モードは短期水平評価では見えず、EMAの慢性汚染とは機械的に異なる。そこで本稿では,教師の行動が時計信号よりも本物の生徒の結束に反応することを保証し,報酬改善と長身安全のジョイントエビデンスに各リフレッシュをゲーティングしながら,孤立期間を保ちながら,孤立期間を保ちながら,教師の運動が本物の学生の結束に確実に反応することを提案する。 1つの共有パラメータセットとデータセットごとの再調整を行わず、CGTRは \textbf{zero collapse} を達成し、各タスクの学習力学にそのリフレッシュ周波数を自己制御する4つのタスク(化学、生物学、物理学、ツールユース)の最高スコアを得る。

論文の概要: When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

関連論文リスト