Fugu-MT 論文翻訳(概要): On the Geometry of On-Policy Distillation

論文の概要: On the Geometry of On-Policy Distillation

arxiv url: http://arxiv.org/abs/2606.07082v2
Date: Wed, 10 Jun 2026 09:18:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 14:23:44.339242
Title: On the Geometry of On-Policy Distillation
Title（参考訳）: オンライン蒸留の幾何学について
Authors: Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung,
Abstract要約: 我々は,大規模言語モデル推論を改善するために,政治蒸留(OPD)について検討する。教師付き微調整(SFT)と強化学習(RLVR)を比較した。我々は、PDが単にSFTとRLVRの中間点であるだけでなく、パラメータ空間における独自の更新幾何を誘導することを発見した。
参考スコア（独自算出の注目度）: 22.873898953554605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.
Abstract（参考訳）: オンライン蒸留(OPD)は、大規模言語モデルの推論を改善するためにますます使われているが、その訓練力学はいまだに理解されていない。我々はパラメータ空間におけるOPD更新の軌跡を特徴付け、それを教師付き微調整(SFT)と強化学習(RLVR)と比較した。パラメータ空間診断のスイートは、PDを緩和されたオフプリンシパル状態に一貫して配置する: SFTと比較すると、その更新は重みを減らし、主方向を強く避けるが、RLVRと比較すると、厳密な制約は少ない。この静的なローカライゼーションの他に、OPDはサブスペースロックを示しており、累積更新は急速に狭い低次元チャネルに入る。トレーニングの初期に形成された更新部分空間への制約は、PD性能を保ちながら、SFTを大幅に低下させ、ロックされた部分空間がOPDに十分であることを示す。さらにコントロール実験では、更新トークンのスペーシングと、ロールアウト生成を政治から切り離すことによって、ランクのダイナミクスを保ちながら、OPDの目的とRLVRを混在させることが示される。これらの結果は、PDが単にSFTとRLVRの中間点であるだけでなく、パラメータ空間における独自の更新幾何学を誘導していることを示唆している。

論文の概要: On the Geometry of On-Policy Distillation

関連論文リスト