Fugu-MT 論文翻訳(概要): Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

論文の概要: Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

arxiv url: http://arxiv.org/abs/2605.29548v2
Date: Mon, 01 Jun 2026 17:29:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:24:16.739104
Title: Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Title（参考訳）: 大規模モデルはなぜより多くを学ぶのか:能力、干渉、希少なタスク保持の影響
Authors: Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana,
Abstract要約: より大規模なモデルでは、無限のトレーニングデータであっても、小さなモデルでは学習できないタスクが学習されることが示される。特に、より小さなモデルでは、ニューロンを高頻度または低複雑性のタスクに割り当て、希少で複雑なタスクでは不十分なソリューションを学ぶ。次に、より大きなモデルがこのデータ中心のボトルネックを回避し、干渉機構の低下に辿り着くかを評価する。
参考スコア（独自算出の注目度）: 42.127946936876235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.
Abstract（参考訳）: より大規模なモデルは、より小さなモデルではできないタスクを学ぶ。なぜこの現象が引き起こされるのか。我々は、より大規模なモデルが、無限のトレーニングデータであっても、より小さなモデルでは学習できないデータ分散の一部を学習できることを既に示唆している、という単純な現象論的議論を発展させる。この主張を検証し,その原因を特定するため,単調なスケーリング曲線を示すタスクの混合からなる合成装置におけるモデルスケーリングの効果について検討した。その結果は、データによって引き起こされるリソース(ニューロン)に対する競合を指す。特に、より小さなモデルでは、ニューロンを高頻度または低複雑性のタスクに割り当て、希少で複雑なタスクでは不十分なソリューションを学ぶ。さらに、これは所望のタスクを表現できる解が存在する場合でも起こる。より大きなモデルは、これらのタスクの勾配更新が弱くなるような共通タスクに十分なリソースを割り当てることができるので、徐々に蓄積するにつれて、レアタスクの特徴を上書きしないことを意味します。最後に、これらの主張をさらに検証するために、周波数と複雑さの異なる新しいタスクに対して、OLMoモデル(4Mから4Bパラメータ)を事前訓練する。その結果、より大規模なOLMoモデルのみが頻繁で複雑なタスクを学習し、これらのモデルはその表現により多くのタスク特徴を組み込んで、タスク間の勾配の干渉を少なくする。全体として、より大規模なモデルが、より小さなモデルで失敗するタスクを学習する理由を、データ中心で説明します。これは、より大きなモデルが実際より優れている理由を説明するのに役立つ。

論文の概要: Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

関連論文リスト