Fugu-MT 論文翻訳(概要): Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

論文の概要: Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

arxiv url: http://arxiv.org/abs/2606.06888v2
Date: Tue, 09 Jun 2026 06:01:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 13:21:50.702691
Title: Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
Title（参考訳）: データ制約付き言語モデルの事前学習:正規化とスケーリング法則の改善
Authors: Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang,
Abstract要約: 正規化とスケーリングという2つの軸に沿ったデータ制約付き事前学習について検討した。モデルサイズとデータサイズを結合して、繰り返しデータの下でのインタラクションをキャプチャするスケーリング法則であるSoftQを提案する。
参考スコア（独自算出の注目度）: 41.26396818761427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.
Abstract（参考訳）: 言語モデルのための古典的なスケーリング法則は、固定された計算予算の下でトレーニングデータセットサイズに対してバランスモデルサイズを事前訓練し、豊富なデータとコーパスを1回のパスと仮定する。トレーニング計算が自然言語データの供給よりも速く成長するにつれて、事前学習は、有限データセット上で複数のエポックをトレーニングするデータ制約付き計算リッチな体制に入る可能性が高い。正規化とスケーリングという2つの軸に沿ったデータ制約付き事前学習について検討した。正規化のためのマスク入力正規化(MIR)は、ランダムなマスク入力に対する補助的な次トーケン予測損失である。 MIRは、拡散言語モデルの中心となるランダムマスキングが、アーキテクチャの変更や推論オーバーヘッドなしに自己回帰事前学習の恩恵を受けるかどうかをテストする。 72M から 1.4B のパラメータモデルにおいて,MIR が強重崩壊の上に付加されたことにより,自己回帰型強重デカイモデルに対する検証損失が 1.4B のダウンストリームゲインで向上することがわかった。スケーリングには、モデルのサイズとデータサイズを結合して、繰り返しデータの下でのインタラクションをキャプチャするスケーリング法則であるSoftQを提案する。チンチラ法のような古典的な代替法は、これらの用語を分離する付加形式を用いており、データに制約された体制では不明確になっている。 We found that SoftQ fits data-constrained experiment than these alternatives and estimates MIR's gains as equivalent to approximately 1.3 times of unique training data。コードをhttps://github.com/yixinw-lab/dc_pretrain.comでリリースします。

論文の概要: Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

関連論文リスト