Fugu-MT 論文翻訳(概要): WRAP++: Web discoveRy Amplified Pretraining

論文の概要: WRAP++: Web discoveRy Amplified Pretraining

arxiv url: http://arxiv.org/abs/2604.06829v1
Date: Wed, 08 Apr 2026 08:47:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.433755
Title: WRAP++: Web discoveRy Amplified Pretraining
Title（参考訳）: WRAP++: Web DiscoveRy Amplified Pretraining
Authors: Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang,
Abstract要約: WRAP++(Web DiscoveRy Amplified Pretraining)を提案する。 WRAP++は、Webハイパーリンクからドキュメント間の関係を発見し、各文書ペア上で共同QAを合成する。 SimpleQAでは、7Bスケールと32BスケールのOLMoベースのモデルは、WRAP++でトレーニングされた。
参考スコア（独自算出の注目度）: 9.79503335028396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.
Abstract（参考訳）: 合成データ言い換えは,大規模言語モデル(LLM)事前学習において,知識獲得を促進する強力な手法として登場した。しかし、既存のアプローチは単一ドキュメントレベルで動作し、個別のWebページを個別に書き換える。これは、合成された例を文書内知識に限定し、文書間の関係を欠いたり、関連性に制限のある事実を残したりする。 WRAP++(Web DiscoveRy Amplified Pretraining)は,Webハイパーリンクから文書間関係を発見し,各文書ペア上で共同QAを合成することにより,事実知識の連想コンテキストを増幅する。具体的には、WRAP++は二重リンクやコメンションを含む高信頼リレーショナルモチーフを発見し、両方のドキュメントをまたぐ推論を必要とするQAを合成する。これにより、いずれのソース文書も関係知識が欠落し、同じ事実に対する多様なエントリポイントが生成される。有効なエンティティペアの数は組合せ的に増加するため、この発見駆動合成は単一文書の書き換えを超えてデータスケールを増幅する。 WikipediaでWRAP++を検証し、原文の8.4BトークンをクロスドキュメントQAデータの80Bトークンに増幅する。 SimpleQAでは、WRAP++でトレーニングされた7Bおよび32BスケールのOLMoベースのモデルが、単一ドキュメントアプローチを大幅に上回り、持続的なスケーリングゲインを示し、クロスドキュメントの知識発見と増幅の利点を強調している。

論文の概要: WRAP++: Web discoveRy Amplified Pretraining

関連論文リスト