Fugu-MT 論文翻訳(概要): Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

論文の概要: Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

arxiv url: http://arxiv.org/abs/2604.17228v1
Date: Sun, 19 Apr 2026 03:20:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.404336
Title: Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
Title（参考訳）: 条件付き深さルーティングにおける補助的損失の再検討 : 実証的研究
Authors: Qingwei Lin,
Abstract要約: ゲート決定は、言語モデリング(LM)の損失に影響を与える前に、多くのレイヤを通して伝播しなければならない。補助的な損失はトレーニングを安定させるために積み重ねられることが多いが、それらの間の相互作用、特に予測的な補助的なスコアと明示的なスコアの監督の間の相互作用は、制御された条件下で体系的に比較されていない。これは、後続のすべてのレイヤがフルに実行されると仮定する、オフポリティのオラクルラベルにトレースしますが、ゲートされた実行ルートはフルに1分しかありません。
参考スコア（独自算出の注目度）: 31.968379218484746
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them -- particularly between a predictive auxiliary and explicit score supervision -- have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor that forecasts, in a low-dimensional latent space, the outcome of executing full vs. cheap per token, aligned against a fixed target head. Under the standard recipe with oracle-style utility regression and pairwise rank supervision (util/rank), G3 improves early-to-mid optimisation over G1 in 3/3 seeds (lower avg LM, faster threshold hits, ~10.3x lower grad norms), with 20k-step endpoint LM within a 0.005 heuristic reference. A key finding (ablation A3): jointly removing util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates, and the early-to-mid advantage of G3 over G1 disappears. We trace this to an off-policy oracle label that assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full -- making util/rank net-negative under the current recipe. Removing util/rank also cuts the training FLOPs proxy from ~1.53x to ~1.07x full-only (2.87h to 1.75h on a V100-32GB, ~39%). Conclusions are scoped to the studied regime.
Abstract（参考訳）: 条件深度実行は、トークンのサブセットを軽量のFFNを介してルーティングし、残りは各制御層で標準完全FFNを実行する。ゲート決定は言語モデリング(LM)の損失に影響を与える前に多くの層を通して伝播しなければならないため、結果として生じる勾配は弱くノイズが多い。補助的な損失はトレーニングを安定させるために積み重ねられることが多いが、それらの間の相互作用、特に予測的な補助的なスコアと明示的なスコアの監督の間の相互作用は、制御された条件下で体系的に比較されていない。コントローラのみのトレーニング,フルパス予算の50%,ファインWeb-eduサブセット上での3シード実行が可能な157.5Mパラメトリックデコーダのみのモデルで2つのゲート設計を評価した。 MLPゲート(G1)は、現在の隠蔽状態をユーティリティスコアにマッピングし、JEPA誘導ゲート(G3)は、低次元の潜在空間において、トークン当たりのフル対安価な実行結果が、固定されたターゲットヘッドに対して整列するアクション条件予測器を付加する。 G3は3/3シード(より低いavg LM、より速いしきい値のヒット、約10.3倍のグレードノルム)でG1の早期から中期の最適化を改善し、0.005ヒューリスティック参照では20kステップのエンドポイントLMを持つ。鍵発見(アブレーションA3)は、両ゲートの3/3シードにおいて、ユーティリティ/ランクを併用除去することにより、最高の/avg LMとしきい値ヒット速度が向上し、G3のG1に対する早期から中期の優位性が消失する。これは、後続のすべてのレイヤがフルに実行されると仮定する、オフポリティのオラクルラベルにトレースしますが、ゲートされた実行ルートはフルに1分しかありません -- 現在のレシピでは、ユーティリティ/ランクのネット陰性になります。 util/rankの削除により、トレーニング用FLOPプロキシは1.53xから1.07xまで(V100-32GBで2.87hから1.75h、約39%)削減される。結論は研究体制に当てはまる。

論文の概要: Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study

関連論文リスト