Fugu-MT 論文翻訳(概要): Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

論文の概要: Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

arxiv url: http://arxiv.org/abs/2509.19189v3
Date: Mon, 03 Nov 2025 13:29:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-04 20:19:58.507314
Title: Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules
Title（参考訳）: カーネル回帰における関数スケーリング法則:損失ダイナミクスと学習速度スケジューリング
Authors: Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu,
Abstract要約: スケーリング法則は、大きな言語モデルのトレーニングを理解し、導くための統一レンズとして登場した。我々は任意のLSSの下で全損失軌跡を捕捉する機能スケーリング法を確立した。データ制限と計算制限の両方で明示的なスケーリング関係を導出する。
参考スコア（独自算出の注目度）: 9.332823269318842
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
Abstract（参考訳）: スケーリング法則は、大きな言語モデル(LLM)のトレーニングを理解し指導するための統一レンズとして登場した。しかし、既存の研究は主に最終段階の損失に焦点をあて、損失ダイナミクス全体が同様の法則に従うかどうかをオープンにし、学習率スケジュール(LRS)がそれらをどのように形成するかを重要視している。我々はこれらのギャップを制御理論的な設定で解決し、パワー・ロー・カーネル回帰モデルに基づいて確率勾配勾配(SGD)を解析する。重要な洞察は、新しい本質的な時間的視点であり、イテレーション数よりもトレーニングの進捗を忠実に捉えます。次に、任意のLSSの下で全損失軌跡を捕捉する関数スケーリング法(FSL)を確立し、スケジュールの影響を単純な畳み込み関数に入力する。さらに、定数、指数減衰、ウォームアップ安定デカイ(WSD)の3つの代表的LSRの理論をインスタンス化し、データと計算に制限されたレシエーションの両方において明示的なスケーリング関係を導出する。これらの比較は、重要な経験的現象を説明する。 (i)高容量モデルは、よりデータ的で、計算効率が高い。 (二)学習率の低下により訓練効率が向上し、 (iii)WSD型スケジュールは純粋に崩壊する。最後に、0.1Bから1BパラメータのLSM実験により、大規模事前学習における損失軌道の適合と予測のための代理モデルとしてのFSLの実用的妥当性が示された。

論文の概要: Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

関連論文リスト