Fugu-MT 論文翻訳(概要): Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

論文の概要: Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

arxiv url: http://arxiv.org/abs/2604.09258v1
Date: Fri, 10 Apr 2026 12:17:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.851356
Title: Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Title（参考訳）: Nexus: 損失をプレトレーニングし、Common Minima経由で下流の一般化を改善する
Authors: Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu,
Abstract要約: 事前学習はLarge Language Models(LLMs)の基盤となる。 LLMはその能力の主要なエンジンとなる計算予算とデータの大半を支配している。
参考スコア（独自算出の注目度）: 47.1662602024628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.
Abstract（参考訳）: プレトレーニングはLarge Language Models(LLM)の基盤であり、計算予算とデータの大半を占め、その能力の主要なエンジンとして機能する。事前訓練の間、LLMは前例のないほど巨大で多様なデータソースから基礎知識を取得し、一般言語、数学、コード、複雑な推論など幅広い分野を包含する。本研究は,事前学習の収束状態に関する興味深い幾何学的問題について考察する: モデルはすべてのデータソース(例 , \cref{fig:cwa_illustration:close})にまたがる共通最小値に収束するか,あるいは単に総和損失の最小値(例 , \cref{fig:cwa_illustration:distant})に収束するか。タスク固有のミニマの幾何学的「クローズネス」は、本質的に下流の一般化と結びついていると仮定する。我々は、標準オプティマイザ(例えばAdamW)が、タスク固有のミニマが互いに離れている点に収束することを明らかにする。そこで本研究では,最適化時の勾配類似度を最大化することにより,これらの最小化を促進できるNexusオプティマイザを提案する。 130Mから3Bパラメータ、様々なデータミックス、ハイパーパラメータスケジュールを含むモデルにわたる実験では、Nexus \textit{significantly boosts downstream performance} しかし、 \textit{ achieved the same pretraining loss} にもかかわらず、Nexus \textit{significantly boosts downstream performance} が示されている(\cref{fig:demo:benchmark} を参照)。特に、3Bモデルでは、Nexusは配布外損失を0.012に減らし、複雑な推論タスク(例えば、GSM8k)の精度を最大15.0\%向上させる。この発見は、モデル評価の唯一のプロキシとしての事前学習損失への依存に挑戦し、下流の一般化をアンロックする際の暗黙のバイアスの重要性を示す。

論文の概要: Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

関連論文リスト