Fugu-MT 論文翻訳(概要): To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

論文の概要: To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

arxiv url: http://arxiv.org/abs/2604.00715v1
Date: Wed, 01 Apr 2026 10:26:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.937854
Title: To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
Title（参考訳）: RAG-Considerate Pretrainingのためのスケーリング法則
Authors: Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng,
Abstract要約: 本研究では,事前学習コーパスサイズと検索ストアサイズとのトレードオフを,広範囲のモデルとデータスケールで検討する。モデルスケール全体にわたるパラメトリックのみのベースラインよりも,検索が一貫してパフォーマンスを向上させることが判明した。
参考スコア（独自算出の注目度）: 24.61808957290675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
Abstract（参考訳）: Retrieval-augmented Generation (RAG)は、知識集約的な状況に対するテスト時に関連するコンテキストを提供することで、言語モデル(LM)のパフォーマンスを向上させる。しかし, 事前学習時に得られるパラメトリック知識と, 検索によってアクセスされる非パラメトリック知識との関係は, 特に固定データ予算下ではよく理解されていない。本研究では,事前学習コーパスサイズと検索ストアサイズとのトレードオフを,広範囲のモデルとデータスケールで系統的に検討する。我々は、最大100BのDCLMデータに対して、30Mから3Bのパラメータを含むOLMo-2ベースのLMをトレーニングし、事前学習データスケール(1-150倍)と検索ストアサイズ(1-20倍)をそれぞれ変更し、推論、科学的QA、オープンドメインQAにまたがる様々なベンチマークスイートのパフォーマンスを評価する。モデルスケールにまたがるパラメトリックのみのベースラインよりも連続的に性能を改善し、モデルサイズ、事前学習トークン、検索コーパスサイズの関数として性能をモデル化する3次元スケーリングフレームワークを導入する。この拡張多様体は、事前学習と検索の間の固定データ予算の最適配分を推定することができ、検索の限界効用がモデルスケール、タスクタイプ、事前学習飽和度に強く依存していることを明らかにする。この結果は,検索が事前学習を補完する時間と方法を理解するための定量的基盤を提供し,スケーラブルな言語モデリングシステムの設計において,データ資源を割り当てるための実践的なガイダンスを提供する。

論文の概要: To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

関連論文リスト