Fugu-MT 論文翻訳(概要): Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

論文の概要: Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

arxiv url: http://arxiv.org/abs/2606.13657v2
Date: Fri, 12 Jun 2026 11:39:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 13:53:03.783876
Title: Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
Title（参考訳）: Dense Supervision, Sparse Updates: the Sparsity and Geometry of On-Policy Distillation
Authors: Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye,
Abstract要約: 教師の高密度監督は、textscOPDを通常の高密度パラメータ書き換えに変換しない。代わりにtextscOPD は、オンラインのポストトレーニングの重要な幾何学的シグネチャを保持する。
参考スコア（独自算出の注目度）: 39.39389868936592
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe by combining two desirable ingredients: on-policy student trajectories and dense teacher supervision. However, how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and \textsc{OPD} use cases, our analysis yields two main findings. On sparsity, \textsc{OPD} updates are small and coordinate-sparse. They are distributed across layers, with the largest relative movement usually appearing in FFN modules. This sparse structure is operationally useful: training only the discovered subnetwork nearly recovers full-training performance. The sparse support does not remove the need for adaptive optimization: SGD, previously reported to be competitive in \textsc{RLVR}, underperforms AdamW in our \textsc{OPD} optimizer ablation, suggesting that dense teacher supervision preserves useful momentum structure and heterogeneous second-moment scales. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.
Abstract（参考訳）: オンライン蒸留(英語版) (\textsc{OPD}) は、最近、オンライン学生軌跡と高密度教師監督という2つの望ましい材料を組み合わせることで、訓練後の顕著なレシピとなった。しかし、このハイブリッドがモデルのパラメータをどのように変えるかは、まだ不明である。いくつかの言語および視覚言語モデルペアと, textsc{OPD} の使用例を比較検討した結果, 主に2つの結果が得られた。スパース性では、 \textsc{OPD} 更新は小さく、座標スパースである。それらは層に分散しており、最も大きな相対運動は通常FFNモジュールに現れる。このスパース構造は運用上有用であり、検出されたサブネットワークのみのトレーニングは、ほぼ完全にトレーニングされた性能を回復する。 SGDは以前 \textsc{RLVR} で競合していると報告されており、我々の \textsc{OPD} オプティマイザアブレーションにおいてAdamWを過小評価している。幾何では、更新は数値的にはフルランクであるがスペクトルに集中しており、主に源重の主特異部分空間から離れ、原重がゼロに近い座標に不均等に落ちる。以上の結果から,高密度教師監督は,通常の高密度パラメータ書き換えを行なわないことが示唆された。

関連論文リスト

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution [5.770893169582546]
この研究は、属性をサブセットレベルの反事実的ユーティリティ予測として再定義する。相互作用を意識したサロゲートであるGRASPを紹介する。それは、既存のスケーラブルなベースラインを決定的に上回る。
論文参考訳（メタデータ） (2026-06-05T04:17:50Z)
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation [51.210887267509854]
OPDの効率は、トレーニングの初期段階において最終モデルに向けた安定的な更新軌道を確立する、フォレスト・オブ・ザ・イヤーズ(foresight')の形式に起因している、と我々は主張する。我々は、外挿ステップのサイズを適応的に選択し、現在の更新方向に沿って移動することにより、OPDを高速化するプラグイン・アンド・プレイ・アクセラレーション手法である textbfEffOPD を提案する。
論文参考訳（メタデータ） (2026-05-12T08:19:15Z)
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization [44.37632368250295]
我々は,単純かつ効果的な幾何エンコーダによって強化されたロバストなLiDARベースの再ローカライズフレームワークであるLEADERを提案する。オックスフォード・ロボットカーとNCLTデータセットの実験は、LEADERが最先端の手法より優れていることを示した。
論文参考訳（メタデータ） (2026-04-13T11:52:29Z)
Exploring 3D Dataset Pruning [42.345465506597044]
本研究では,3次元データに対するデータセットのプルーニングについて検討する。重み付けされたサブセットで全データ予測リスクを近似すると、表現力の不足によるカバレッジエラーと、サブセットによるクラス重みとターゲットメトリクスの不整合による事前ミスマッチバイアスの2つの重要なエラーが明らかになる。
論文参考訳（メタデータ） (2026-02-28T13:42:11Z)
Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models [27.847140934456288]
本稿では,新しい重み劣化手法Selective Projection Decay (SPD)を提案する。 SPDは特定の層に強いペナルティを課し、他の層は自由に変更できる。 SPDを搭載した場合、Adamはベンチマーク上でより優れた分散ロバスト性とアウト・オブ・ディストリビュート性能を提供する。
論文参考訳（メタデータ） (2024-11-03T23:36:53Z)
Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud Semantic Segmentation via Decoupling Optimization [64.36097398869774]
半教師付き学習(SSL)は大規模3Dシーン理解のための活発な研究課題である。既存のSSLベースのメソッドは、クラス不均衡とポイントクラウドデータのロングテール分布による厳しいトレーニングバイアスに悩まされている。本稿では,特徴表現学習と分類器を別の最適化方法で切り離してバイアス決定境界を効果的にシフトする,新しいデカップリング最適化フレームワークを提案する。
論文参考訳（メタデータ） (2024-01-13T04:16:40Z)
Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training [59.48691524227352]
スパースニューラルネットワークをゼロからトレーニングするには、接続自体と同時にコネクションを最適化する必要がある。トレーニング中に各レイヤ間の接続は複数回最適化されるが、各レイヤの密度は通常一定である。我々は、すべての層に重みを分散するテクニックであるGlobal Gradient-based Redistributionを提案する。
論文参考訳（メタデータ） (2022-10-25T13:32:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。