Fugu-MT 論文翻訳(概要): Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

論文の概要: Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

arxiv url: http://arxiv.org/abs/2603.23562v2
Date: Mon, 30 Mar 2026 08:42:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 13:48:18.798585
Title: Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
Title（参考訳）: 合成混合トレーニング:RAGを超えるパラメトリック知識獲得のスケーリング
Authors: Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, Yejin Choi,
Abstract要約: 本稿では,合成QAと合成文書を組み合わせた合成混合訓練について紹介する。これにより、合成データボリュームとジェネレータ強度が増大するにつれて、ログリニアの改善が可能になる。モデルとベンチマーク全体を通じて、トレーニングにより、モデルがRAGを6つの設定のうち5つで上回り、パフォーマンスが2.6%向上し、RAGと組み合わせると9.1%向上する。
参考スコア（独自算出の注目度）: 56.95387658211215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6%, and achieves a 9.1% gain when combined with RAG.
Abstract（参考訳）: 合成データ拡張は、言語モデルがデータ制約されたドメインで新しい知識を学ぶのに役立つ。しかし、既存の合成データ手法を、より多くの合成トークンをトレーニングしたり、より強力なジェネレータを使用したりすることで、RAGの性能より低いリターンを減少させる。 RAG天井を破るために,合成QAと合成文書を組み合わせた合成混合訓練を導入する。これにより、相補的なトレーニング信号が利用でき、合成データ量とジェネレータ強度が増大するにつれて、対数線形の改善が可能になる。これにより、長い文書読解ベンチマークであるQuaLITYの相対的な利得がRAGより2.6%向上する。さらに、Focal Rewritingは、特定の質問に対して文書生成を明示的に条件付けし、合成文書の多様性を改善し、より急勾配な対数線形スケーリング曲線を得る合成文書生成の簡単なテクニックである。最後のレシピでは、RAGを4.4%上回るLlama 8Bモデルをトレーニングしています。モデルとベンチマーク全体(QuaLITY, LongHealth, FinanceBench)のトレーニングでは、6つの設定のうち5つでRAGを倒し、2.6%でパフォーマンスを上回り、RAGと組み合わせると9.1%のゲインを達成した。

論文の概要: Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

関連論文リスト