Fugu-MT 論文翻訳(概要): Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

論文の概要: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

arxiv url: http://arxiv.org/abs/2505.19227v1
Date: Sun, 25 May 2025 16:43:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-27 16:58:42.990982
Title: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law
Title（参考訳）: ジグフの法則に基づく線形ビグラムモデルの勾配日射と符号日射のスケーリング法則
Authors: Frederik Kunstner, Francis Bach,
Abstract要約: 最近の研究は、トランスフォーマーベースの言語モデルの最初の層と最後の層を訓練する際の勾配降下による困難を浮き彫りにした。これらの研究は、テキストデータ中の単語の重み付き分布に、難易度が関係していることを示唆している。データが重い尾を持つ場合、問題はより困難であることを示す。
参考スコア（独自算出の注目度）: 4.6193503399184275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $\pi_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $\pi_k \propto 1/k^\alpha$ parameterized by the exponent $\alpha > 0$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $\alpha$. Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent $\alpha > 1$. This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $\alpha = 1$ as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.
Abstract（参考訳）: 最近の研究は、Adamのようなオプティマイザが克服したトランスフォーマーベースの言語モデルの最初の層と最後の層をトレーニングする際の勾配降下による最適化の難しさを強調している。これらの研究は、この困難さがテキストデータ中の単語の重み付き分布と関連していることを示唆しており、ここでは、$k$thの頻繁な単語である$\pi_k$の頻度はZipfの法則に従って1/k$に比例する。トレーニング性能に対するデータ分散の影響をよりよく理解するため, 指数$\alpha > 0$ でパラメータ化されたパワー則 $\pi_k \propto 1/k^\alpha$ に従えば, 次トーケン予測のための線形ビッグラムモデルについて検討する。我々は、指数$\alpha$の関数としてAdamの代用として、決定論的勾配降下と符号降下の最適化スケーリング法則を導出した。既存のスケール法則に関する理論的研究は、データ固有値は指数$\alpha > 1$のパワー法則として崩壊していると仮定している。この仮定は、ほとんどの損失は、最も大きな固有成分のごく一部から生じるので、「有限次元」という問題を効果的に解決する。比較すると、データが重い尾を持つ場合、問題はより困難である。テキストデータに見られる $\alpha = 1$ の場合、勾配降下の ``worst-case'' であり、小さな相対誤差に到達するのに必要なイテレーションの数は、ほぼ次元とともに線形にスケールする。符号降下の性能は次元にも依存するが、Zipf分散データの場合、繰り返しの数は次元の平方根でしかスケールしないため、大きな語彙が大幅に改善される。

論文の概要: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

関連論文リスト