Fugu-MT 論文翻訳(概要): Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

論文の概要: Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

arxiv url: http://arxiv.org/abs/2606.16246v2
Date: Fri, 19 Jun 2026 17:02:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.147549
Title: Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining
Title（参考訳）: データ制約付き言語モデル事前学習のためのデミスティフィケーショントレーニング-時間拡張
Authors: Michael K. Chen, Xikun Zhang, Fan Bai, Zhengding Hu, Zhen Wang,
Abstract要約: 言語モデルの事前訓練は、データ制約付き、計算能力のある体制へと移行しつつある。トークンレベルのノイズ,シーケンス順列,ターゲットオフセット予測という,AR事前学習のための拡張の3つのカテゴリを紹介した。個々の強化がオーバーフィッティングを遅らせ、ベースラインに対する検証損失が低下していることが判明した。
参考スコア（独自算出の注目度）: 6.5664347332837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.
Abstract（参考訳）: AIラボが計算能力が新しい高品質テキスト生成の速度を上回るようなデータ天井に近づくにつれ、言語モデルの事前訓練は、固定されたコーパス上で生産的なマルチエポックトレーニングを要求する、データに制約のある計算能力のある体制へとシフトしつつある。標準自己回帰(AR)事前訓練は、この設定において過度に適合し、その最適を早期に達成し、連続的に劣化する。我々は、この過度な適合を緩和し、同じデータ上で数百のエポックに対する生産的トレーニングを可能にするために、レギュレータとしてのトレーニング時間データ拡張について検討する。トークンレベルのノイズ(マスキング、ランダム置換)、シーケンス置換(右から左への予測、Fill-in-the-Middle)、ターゲットオフセット予測(x_{t+i}$ for $i > 1$)である。体系的な改善により,各手法で最大最小損失を達成できるランダムトークン置換により,個々の拡張遅延がオーバーフィットし,ベースラインに対する検証損失が低くなることが判明した。拡張カテゴリを組み合わせることで、最小限のバリデーション損失が減少する。我々の実験は、データ拡張がARプリトレーニングのデータ非効率を軽減し、データ制約された状態~\footnote{allコードとデータはhttps://github.com/ Michelchen-lab/ data-augmentations-for-pretrainingで利用可能であることを示す。

論文の概要: Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

関連論文リスト