Fugu-MT 論文翻訳(概要): Data-efficient pre-training by scaling synthetic megadocs

論文の概要: Data-efficient pre-training by scaling synthetic megadocs

arxiv url: http://arxiv.org/abs/2603.18534v1
Date: Thu, 19 Mar 2026 06:30:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.983891
Title: Data-efficient pre-training by scaling synthetic megadocs
Title（参考訳）: 合成メガドックのスケーリングによるデータ効率向上
Authors: Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang,
Abstract要約: 損失スケーリングを向上する合成データアルゴリズムの設計方法について述べる。まず,Webデータと合成リフレーズを混合した事前学習により,検証損失が向上することを示す。同じ文書から生成した合成世代は、1つの非常に長いメガドキュメントを形成することができます。
参考スコア（独自算出の注目度）: 108.28995799706763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.
Abstract（参考訳）: 事前トレーニングが計算ではなくデータによって制約される場合、合成データ拡張は有望な解決策として現れている。本研究では,有限計算における損失低減だけでなく,無限大の計算手法として,より優れた損失スケーリングを実現する合成データアルゴリズムの設計法について検討する。まず, 合成データと合成リフレーズを混合したWebデータの事前学習により, 全く異なる分布から得られる合成データにもかかわらず, ウェブデータの検証損失が向上することを示す。最適なミキシングとエポーチ、損失とベンチマークの精度は、合成世代が増加するにつれて過度に適合することなく改善され、1文書あたり32の言い換えで1.48\times$データ効率に近づいた。同じ文書から合成された世代は、多くの短い文書ではなく、はるかに長い1つのメガドキュメントを形成することができます。我々は,同じウェブ文書から合成リフレーズを縫い付けるか,あるいは有理を挿入して文書を伸長する2つの方法を示す。どちらの手法も、単純な言い換えに比べて、i.d.d.損失、ダウンストリームのベンチマーク、特に長いコンテキストの損失を改善し、データ効率は、1文書あたり$1.48\times$から$1.80\times$で$32$まで向上した。重要なことは、より合成データが生成されるにつれて、単純な言い換えによるメガドックの改善が広まることである。以上の結果から,データ制約時の計算量の増加に寄与する合成データアルゴリズムの設計方法が示唆された。

論文の概要: Data-efficient pre-training by scaling synthetic megadocs

関連論文リスト