Fugu-MT 論文翻訳(概要): Language models scale reliably with over-training and on downstream tasks

論文の概要: Language models scale reliably with over-training and on downstream tasks

arxiv url: http://arxiv.org/abs/2403.08540v2
Date: Fri, 14 Jun 2024 20:21:05 GMT
ステータス: 翻訳完了
システム内更新日: 2024-06-19 05:27:06.212955
Title: Language models scale reliably with over-training and on downstream tasks
Title（参考訳）: 言語モデルはオーバートレーニングと下流タスクで確実にスケールする
Authors: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt,
Abstract要約: スケーリング法則は、高価なトレーニング実行を引き出すための有用なガイドである。しかし、現在の研究と言語モデルがどのように訓練されているかには差がある。対照的に、スケーリング法則は主に推論における損失を予測するが、モデルは通常下流のタスクのパフォーマンスで比較される。
参考スコア（独自算出の注目度）: 121.69867718185125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.
Abstract（参考訳）: スケール法則は、より安価で小規模な実験で大規模なモデルの性能を予測するため、高価なトレーニング実行を引き出す上で有用なガイドである。しかしながら、現在のスケーリング研究と、言語モデルが最終的にトレーニングされ、評価される方法との間には、依然としてギャップがある。例えば、スケーリングは通常、計算最適トレーニングレギュレーション("Chinchilla optimal"レギュレーション)で研究される。対照的に、モデルはしばしば推論コストを減らすために過度に訓練される。さらに、スケーリング法則は、主に次のトーケン予測における損失を予測するが、モデルは通常、下流タスクのパフォーマンスで比較される。両方の欠点に対処するため、3つのデータ分布に様々なトークンで訓練された0.011Bから6.9Bパラメータを持つ104モデルのテストベッドを作成します。まず、オーバートレーニングの量とモデルパラメータの数の両方を外挿するスケーリング法則に適合する。これにより,11.4Bパラメータ,900Bトークン実行(32$\times$オーバートレーニング),6.9Bパラメータ,138Bトークン実行(計算最適化実行)の検証損失を予測することができる。第二に、言語モデルの難易度と、その下流タスク性能を、電力法則を提案することによって関連付ける。この法則を用いて、上記の2つのモデルに対する下流タスクで平均化されたトップ1エラーを予測し、20$\times$少ない計算を必要とする実験を使用する。実験はhttps://github.com/mlfoundations/scaling.comで公開しています。

論文の概要: Language models scale reliably with over-training and on downstream tasks

関連論文リスト