Fugu-MT 論文翻訳(概要): Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

論文の概要: Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

arxiv url: http://arxiv.org/abs/2510.01631v1
Date: Thu, 02 Oct 2025 03:24:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.966864
Title: Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Title（参考訳）: LLMプレトレーニングにおける合成データのデミスティフィケーション:スケーリング法則,ベネフィット,ピットフォールの体系的研究
Authors: Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, Carole-Jean Wu,
Abstract要約: 大規模言語モデル(LLM)のスケーリングにおいて、トレーニングデータは重要な役割を果たすが、高品質なデータは供給が限られている。自然のWebデータ、多様な合成タイプ(言い換えテキスト、生成された教科書)、および自然と合成データの混合を比較した。合成テキストの事前学習は、天然のWebテキストの事前学習よりも高速ではない。
参考スコア（独自算出の注目度）: 25.294408301653576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. "Good" ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on "model collapse" during large-scale single-round (n=1) model training on synthetic data--training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by "model collapse". Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.
Abstract（参考訳）: 大規模言語モデル(LLM)のスケーリングにおいて、トレーニングデータは重要な役割を果たすが、高品質なデータは供給が限られている。合成データ技術は、これらの制限をサイドステッピングする潜在的な道筋を提供する。我々は、統一されたプロトコルとスケーリング法則を用いて、大規模な実証調査(>1000 LLMと>100k GPU時間)を行い、自然ウェブデータ、多様な合成タイプ(リフレーズテキスト、生成された教科書)、および自然データと合成データの混合を比較した。具体的には,2/3の天然ウェブテキストを混合した1/3の合成データの事前学習は,より大きなデータ予算で5～10倍(同じ検証損失に達するため)に高速化できる。教科書形式の合成データであるtextit{alone} の事前学習は、特に小さなデータ予算において、多くの下流ドメインにおいて顕著に損失を減少させる。学習データ混合物中の合成データの「Good」比は, モデルサイズとデータ予算に依存し, 言い換えて30%程度に収束する。より大規模なジェネレータモデルは、必ずしも ~8B-param モデルよりも優れた事前学習データを得るとは限らない。これらの結果は, 大規模単ラウンド(n=1)モデルトレーニングにおける「モデル崩壊」の混合証拠として, 予見可能なスケールでの性能低下を示さないが, 教科書形式の純合成合成データの混合学習では「モデル崩壊」によって予測されるパターンが示される。本研究は, 事前学習における合成データをデミステレーションし, 条件付利益を検証し, 実践的なガイダンスを提供する。

論文の概要: Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

関連論文リスト