Fugu-MT 論文翻訳(概要): Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

論文の概要: Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

arxiv url: http://arxiv.org/abs/2605.17849v1
Date: Mon, 18 May 2026 04:44:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.811196
Title: Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
Title（参考訳）: データ境界スケーリングのための有機データからの事前学習トークンの生成
Authors: Zichun Yu, Chenyan Xiong,
Abstract要約: SynProは、LLMが限られた有機データからより深く学習するのに役立つ合成データ生成フレームワークである。我々は,DCLMベースラインからチンチラ最適トークン(0.8Bおよび2.2B)の10%を有する400Mおよび1.1Bモデルを事前訓練した。以上の結果から, 有機データは標準的繰り返しによって著しく過小評価されていることが明らかとなった。
参考スコア（独自算出の注目度）: 28.30636190022749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.
Abstract（参考訳）: LLMプリトレーニングは、計算バウンドからデータバウンドレシエーションに移行している。しかし、データバウンド体制に達することは、モデルがその有機コーパスを完全に活用したという意味ではない。本稿では,LLMが限られた有機データからより深く学習するのに役立つ合成データ生成フレームワークであるSynProを紹介する。 SynProは、外部情報を導入することなくより深い学習を促進するために、さまざまな形で同じ有機源を示すリフレーズとリフォームという2つの操作を適用している。両方のジェネレータは、品質、忠実さ、データ影響の報奨を伴う強化学習によって最適化され、モデルがまだ吸収していないコンテンツを対象とする事前学習高原として継続的に更新される。我々は,DCLMベースラインからチンチラ最適トークン(0.8Bおよび2.2B)の10%を有する400Mおよび1.1Bモデルをプレトレーニングし,フロンティア事前トレーニングにおける現実的なデータバウンドの仕組みを反映した。 SynProは3.7-5.2倍の有効トークンをアンロックし、1.1Bスケールで同等のユニークなデータをトレーニングする非データバウンドオラクルを越えさえします。分析は、忠実でモデル対応の合成が、分散崩壊を引き起こすことなく、データバウンドスケーリングを持続することを確認した。ソースコードはhttps://github.com/cxcscmu/SynPro.comで公開しています。

論文の概要: Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

関連論文リスト