Fugu-MT 論文翻訳(概要): Scaling Laws for Mixture Pretraining Under Data Constraints

論文の概要: Scaling Laws for Mixture Pretraining Under Data Constraints

arxiv url: http://arxiv.org/abs/2605.12715v2
Date: Fri, 15 May 2026 17:01:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:25.960655
Title: Scaling Laws for Mixture Pretraining Under Data Constraints
Title（参考訳）: データ制約下における混合事前学習のスケーリング法則
Authors: Anastasiia Sedova, Skyler Seto, Natalie Schluter, Pierre Ablin,
Abstract要約: 一般的な戦略は、少ないが価値のあるターゲットデータと豊富な汎用データを組み合わせることである。このトレードオフを2000以上の言語モデルトレーニングランで研究する。繰り返しは、ターゲットドメインのパフォーマンスの中心的な要因であることがわかった。
参考スコア（独自算出の注目度）: 20.29616657791023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.
Abstract（参考訳）: 言語モデルがスケールするにつれて、それらが必要とするデータ量が増えます -- しかし、低リソース言語や特殊なドメインといった多くのターゲットデータソースは、本質的にサイズに制限されています。この希少だが価値のあるターゲットデータを豊富な汎用データと組み合わせることが一般的な戦略であり、これは基本的なトレードオフである:混在するターゲットデータが小さすぎるとターゲットドメインに過小評価され、ターゲットデータが多すぎると、同じ例を過度に繰り返すため、リターンが減少し、最終的なオーバーフィッティングが発生する。複数のモデルとターゲットデータセットサイズにまたがる2000以上の言語モデルトレーニングと、マルチリンガル、ドメイン固有、品質フィルタの混合を含む複数のデータタイプを対象とする、このトレードオフについて検討する。すべての設定において、反復は目標ドメインのパフォーマンスの中心的な要因であり、混合トレーニングは単一ソーストレーニングよりもはるかに高い繰り返しを許容する: 少ないターゲットコーパスは15～20回再利用でき、ターゲットデータサイズ、計算予算、モデルスケールによって最適な回数が繰り返される。次に、繰り返しターゲットトークンの値の減少とジェネリックデータの正規化の役割を考慮に入れた繰り返し対応混合スケーリング法を導入する。スケーリング法則の最適化は、効果的な混合構成を計算するための原則化された方法を提供し、データ制約の下で事前トレーニングを行うための実践的な混合レコメンデーションを提供する。

論文の概要: Scaling Laws for Mixture Pretraining Under Data Constraints

関連論文リスト