Fugu-MT 論文翻訳(概要): Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

論文の概要: Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

arxiv url: http://arxiv.org/abs/2603.15958v1
Date: Mon, 16 Mar 2026 22:21:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.015296
Title: Deriving Hyperparameter Scaling Laws via Modern Optimization Theory
Title（参考訳）: 現代最適化理論によるハイパーパラメータスケーリング法則の導出
Authors: Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto,
Abstract要約: 線形最小化Oracle(LMO)に基づく手法の最近の一階境界について検討する。近年の文献のバウンダリをプロキシとして扱い、異なるチューニング規則をまたいでそれらを最小化することで、学習率、運動量、バッチサイズに関するクローズドフォームのパワーロースケジュールが得られる。本研究の結果は, 運動量とバッチサイズスケーリングの相互作用に特に注意を払っており, いくつかのスケーリング戦略によって最適性能が達成される可能性が示唆された。
参考スコア（独自算出の注目度）: 55.63126290312615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.
Abstract（参考訳）: ハイパーパラメータ転送は、現代の大規模トレーニングレシピの重要な構成要素となっている。 muPのような既存の手法は、主にモデルサイズ間の転送に重点を置いており、バッチサイズをまたいだ転送やトレーニングの地平線は、時間スケール保存、二次プロキシ、連続時間近似からの洞察から得られる経験的なスケーリングルールに依存していることが多い。線形最小化Oracle (LMO) に基づく手法の収束バウンダリのレンズを用いて, 現代の一階最適化器のハイパーパラメータスケーリング法則について検討した。近年の文献のバウンダリをプロキシとして扱い、異なるチューニング体制でそれらを最小化することで、反復やトークン予算の関数として学習率、運動量、バッチサイズのためのクローズドフォームのパワーロースケジュールが得られる。我々の分析は、モデルのサイズを固定し、統一的で原則的な視点で文献から多くの洞察と観察を回収し、将来の研究に向けて明確な方向を開いている。本研究の結果は, 運動量とバッチサイズスケーリングの相互作用に特に注意を払っており, いくつかのスケーリング戦略によって最適性能が達成される可能性が示唆された。

論文の概要: Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

関連論文リスト