Fugu-MT 論文翻訳(概要): Pre-training under infinite compute

論文の概要: Pre-training under infinite compute

arxiv url: http://arxiv.org/abs/2509.14786v1
Date: Thu, 18 Sep 2025 09:36:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.149214
Title: Pre-training under infinite compute
Title（参考訳）: 無限計算による事前学習
Authors: Konwoo Kim, Suhas Kotha, Percy Liang, Tatsunori Hashimoto,
Abstract要約: 本研究では、エポック数の増加とパラメータ数の増加に対するデータ制約によるアプローチが、最終的には過度に適合することを示す。独立に訓練されたモデルのアンサンブルは、正規化レシピよりもはるかに低損失の漸近を達成できる。この結果から,計算量の多い将来において,よりデータ効率の高い事前学習が実現できることが示唆された。
参考スコア（独自算出の注目度）: 87.02472603429936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count eventually overfit, and we significantly improve upon such recipes by properly tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a simple power law in parameter count, we estimate its best possible performance via the asymptote of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at much smaller parameter counts as we can distill an ensemble into a student model that is 8$\times$ smaller and retains $83\%$ of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9\%$ improvement for pre-training evals and a $17.5\times$ data efficiency improvement over continued pre-training on math mid-training data. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.
Abstract（参考訳）: 言語モデルの事前学習で利用可能なWebテキストよりも高速に計算が成長するので、固定データの下で事前学習し、計算制約を伴わないようにする方法を問う。まず,エポック数の増加とパラメータ数の増加という既存のデータ制約のアプローチが,結局は過度に適合することを示すとともに,正規化を適切に調整することで,最適重量減衰が標準法より30\times$大きいことを明らかにする。規則化されたレシピはパラメータ数における単純な電力法則に従って損失を単調に減少させるので、固定された計算予算における性能よりも、そのスケーリング法則の漸近によって最大限の性能を推定する。次に、独立に訓練されたモデルのアンサンブルにより、正規化レシピよりもはるかに低損失の漸近が得られることを確認した。 Epoching、正規化、パラメータスケーリング、アンサンブルスケーリングを組み合わせた最良の介入は、ベースラインよりも5.17\times$安いデータを使用して、200万トークンで漸近的に達成します。我々のデータ効率の利得は、より小さなパラメータ数で実現でき、それは、アンサンブルを8$\times$より小さくし、アンサンブルの利益の833\%を保っている学生モデルに蒸留できるからである。最後に、検証損失のために設計された介入は、ダウンストリームベンチマークに一般化され、事前トレーニングのevalに対して9\%$改善され、17.5\times$データ効率の改善が、数学の中間トレーニングデータに対する継続事前トレーニングよりも達成される。この結果から,計算量の多い将来において,よりデータ効率の高い事前学習が実現できることが示唆された。

論文の概要: Pre-training under infinite compute

関連論文リスト