Fugu-MT 論文翻訳(概要): Parallel Scaling Law for Language Models

論文の概要: Parallel Scaling Law for Language Models

arxiv url: http://arxiv.org/abs/2505.10475v1
Date: Thu, 15 May 2025 16:24:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-16 22:29:06.417526
Title: Parallel Scaling Law for Language Models
Title（参考訳）: 言語モデルの並列スケーリング法則
Authors: Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu,
Abstract要約: モデルの並列計算をトレーニング時間と推論時間の両方で増加させるという,3番目の,より推論効率のよいスケーリングパラダイムを導入します。理論的に新しいスケーリング法則を提案し,それを大規模事前学習により検証することにより,$P$並列ストリームを持つモデルがより優れた推論効率を示しつつパラメータを$O(log P)$にスケーリングするのと類似していることを示す。
参考スコア（独自算出の注目度）: 45.799251718923614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.
Abstract（参考訳）: 言語モデルのスケーリングは、パラメータ(パラメータスケーリング)や出力トークン(推論時間スケーリング)を増大させることによって、かなりのスペースや時間コストを消費すべきである、と一般的に信じられている。モデルの並列計算をトレーニング時間と推論時間の両方で増加させるという,3番目の,より推論効率のよいスケーリングパラダイムを導入します。入力に$P$の多様で学習可能な変換を適用し、モデルの前方通過を並列に実行し、$P$の出力を動的に集約する。この手法、すなわち並列スケーリング(ParScale)は、既存のパラメータを再利用することで並列計算をスケールし、任意のモデル構造、最適化手順、データ、タスクに適用できる。理論的に新しいスケーリング法則を提案し,それを大規模事前学習により検証することにより,$P$並列ストリームを持つモデルがより優れた推論効率を示しつつパラメータを$O(\log P)$にスケーリングするのと類似していることを示す。例えば、ParScaleは、同じパフォーマンス改善を実現するパラメータスケーリングと比較して、最大22$\times$メモリ増加率と6$\times$レイテンシ増加率を削減できる。また、少量のトークンでトレーニングした後、トレーニング予算を減らして、市販の事前訓練モデルを並列スケールに再利用することもできる。私たちが発見した新しいスケーリング法則は、低リソースシナリオにおけるより強力なモデルのデプロイを促進する可能性があり、機械学習における計算の役割に対する代替的な視点を提供する。

論文の概要: Parallel Scaling Law for Language Models

関連論文リスト