Fugu-MT 論文翻訳(概要): The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

論文の概要: The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

arxiv url: http://arxiv.org/abs/2603.16177v1
Date: Tue, 17 Mar 2026 06:55:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.136772
Title: The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
Title（参考訳）: ファインチューナーの欠陥:いつデータでプレトレーニングするか
Authors: Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini,
Abstract要約: 本稿では,トークンの総数に占めるプレトレーニングから始めて,小さなドメインデータセットを繰り返す,SPT(Special Pretraining)というシンプルな戦略について検討する。我々の実験では、SPTは与えられたドメインの性能に到達するのに必要な事前学習トークンを最大1.75倍まで削減する。ファインタニングは、ドメイン適応への最も安い道のように見えるが、事前訓練中に特別なドメインデータを導入することは、その実用性を広げる。
参考スコア（独自算出の注目度）: 55.87500250831868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.
Abstract（参考訳）: 実世界のモデルデプロイメントは、データがほとんどない狭いドメインに対して強力なパフォーマンスを要求する。通常、実践者はモデルを微調整して専門化しますが、このリスクはドメインに過度に適合し、一般的な知識を忘れます。我々は、通常、微調整用に予約された小さなドメインデータセットを、トークン全体のごく一部として事前訓練から始める、単純な戦略、特殊事前訓練(SPT)について検討する。 3つの専門ドメイン(ChemPile、MusicPile、ProofPile)にわたって、SPTはドメインパフォーマンスを改善し、通常の事前トレーニングと比較して微調整後の一般的な機能を維持する。我々の実験では、SPTは与えられたドメインの性能に到達するのに必要な事前学習トークンを最大1.75倍まで削減する。 Webテキストから遠い領域では、1B SPTモデルは3B標準の事前訓練されたモデルよりも優れています。これらの経験的利益の他に、所定の事前訓練された計算予算に対して最適なドメインデータ繰り返しを選択する際に、実践者を支援するために、オーバーフィッティングのスケーリング法則を導出します。ファインタニングは、ドメイン適応への最も安い道のように見えるが、事前訓練中に特別なドメインデータを導入することは、その実用性を広げる。 SPTは、(繰り返し露光によるオーバーフィッティングを減らし)より優れた特殊ドメイン性能と(微調整中の忘れを減らし)より優れた汎用ドメイン性能を得る。ドメインデータを最大限に活用するには、可能な限り早期にトレーニングを組み込む必要がある。

論文の概要: The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

関連論文リスト