Fugu-MT 論文翻訳(概要): q0: Primitives for Hyper-Epoch Pretraining

論文の概要: q0: Primitives for Hyper-Epoch Pretraining

arxiv url: http://arxiv.org/abs/2606.03938v2
Date: Wed, 03 Jun 2026 02:07:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.648085
Title: q0: Primitives for Hyper-Epoch Pretraining
Title（参考訳）: q0: Hyper-Epoch Pretrainingのプリミティブ
Authors: Bishwas Mandal, Shmuel Berman, Akshay Vegesna, Samip Dahal,
Abstract要約: 単一のモデルの事前訓練は、計算予算が枯渇するずっと前に、数パス以内に飽和する。ハイパーエポック事前学習(q0)を導入し,マルチエポック予算を多種多様なモデルに転換する。我々は,q0が56エポック(4.6倍)または67エポック(3.8倍)の強い256エポックアンサンブルベースラインと一致することを示す。
参考スコア（独自算出の注目度）: 0.5980755233352995
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ~56 epochs (~4.6x fewer), or ~67 epochs (~3.8x fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ~12.9x data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.
Abstract（参考訳）: マルチエポックトレーニングは、高品質なテキストの供給よりも高速にコンピューティングが成長している、という標準になりつつある。しかし、1つのモデルの事前トレーニングは、計算予算が枯渇するずっと前に、数パス以内に飽和する。これは、単一のモデルをトレーニングすることから、モデルの集団を探索し、予測を集約することへと、概念的なシフトを要求するものだ、と私たちは主張する。マルチエポック・プレトレーニング (q0) を導入し, マルチエポック・プレトレーニングの予算を, 1つの改良モデルよりも検証損失が低い多種多様なモデル群に変換する。 q0は3つのコアプリミティブに減少する。反相関学習率と重み減衰を伴う循環スケジュールは、いくつかの平行軌跡から様々なモデルを収集する。チェイン蒸留は、各モデルを前任者に対して訓練し、人口全体で品質の高い化合物をモデル化する。学習済みの事前学習は、完了したセットに適合し、任意の推論予算に対してメンバーを選択し、重み付けする。 100Mファインウェブトークンでトレーニングされた1.8Bパラメーターモデルでは、q0は56エポック(~4.6倍)または67エポック(~3.8倍)で256エポックアンサンブルベースラインと一致し、それを超えて改善を続けている。これらのゲインは、Slowrun設定の下で累積約12.9倍のデータ効率に達し、下流のベンチマークに転送される。重要なことは、最適なアロケーションは予算とともにシフトするため、一つのエポックから最大の予算まで、与えられたエポック予算を最大化するためにどのように使うかの規範的なレシピを提供する。

論文の概要: q0: Primitives for Hyper-Epoch Pretraining

関連論文リスト