Fugu-MT 論文翻訳(概要): AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

論文の概要: AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

arxiv url: http://arxiv.org/abs/2511.14721v1
Date: Tue, 18 Nov 2025 18:08:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:53.25353
Title: AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
Title（参考訳）: AdamHD: 言語モデルの事前トレーニングのための分離されたHuberの正規化
Authors: Fu-Ming Guo, Yingfang Fan,
Abstract要約: AdamHuberDecayはAdamWのドロップイン代替品で、$ell$ペナルティを分離したスムーズなHuberレギュレータで置き換える。 GPT-2 と GPT-3 の事前学習実験により,AdamHuberDecay は壁面時間で 10-15% の速度で収束することが示された。
参考スコア（独自算出の注目度）: 0.2578242050187029
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the $\ell_2$ penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold $δ$, and linearly ($\ell_1$-like) once they exceed $δ$, yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at $O(1)$ extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, (c) delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers.
Abstract（参考訳）: アダムWのような非結合重み崩壊を持つ適応オプティマイザは、大きなトランスフォーマーベースの生成モデルを事前学習するデファクトスタンダードである。しかし、重量崩壊に埋め込まれた$\ell_2$のペナルティの二次的性質は、全てのパラメータを同じ速度で原点に向かって駆動し、更新は稀だが極度な勾配の方向に脆弱であり、よく調和された座標を過給することが多い。我々はAdamWの代替品であるAdamHuberDecayを提案し、$\ell_2$のペナルティを分離したスムーズなHuber正規化器で置き換える。結果として得られた更新はパラメータを2次的に減衰させ、その大きさは閾値$δ$以下であり、その値が$δ$を超えると線形($\ell_1$-like)となる。 (i)有界正規化勾配二第二モーメント再スケーリングの調整による相違、及び (三)太りすぎの重みに対する空間圧の強いもの。閉形式の疎結合ハマー崩壊ステップを導出し、任意のアダム科オプティマイザと組み合わせて$O(1)$余分なコストでそれを統合する方法を示す。 GPT-2およびGPT-3事前学習に関する広範囲な実験により、AdamHuberDecayが証明された (a)ウォールタイムで10～15%早く収束する。 (b)検証難易度を最大4点まで低減する。 (c)ダウンストリームタスク間で2.5-4.7%のパフォーマンス改善を提供し、 (d)AdamWのデフォルトグリッドを超える減衰係数を調整することなく、大まかなプルーニング後に20～30%のメモリ節約に変換する、可視的にスペーサー重量ヒストグラムを出力する。アブレーションは、ノイズの多い更新の下で期待されるパラメータノルムを束縛する理論解析とともに、外層勾配と大バッチ状態に対するロバスト性を確認する。そのため、AdamHuberDecayは、より効率的でレジリエンスな次世代生成トランスのトレーニングに向けて、シンプルで原則化されたパスを提供する。

論文の概要: AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

関連論文リスト