Related papers: AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

URL: http://arxiv.org/abs/2511.14721v1
Date: Tue, 18 Nov 2025 18:08:20 GMT
Title: AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training
Authors: Fu-Ming Guo, Yingfang Fan,
Abstract summary: AdamHuberDecay is a drop-in replacement for AdamW that substitutes the $ell$ penalty with a decoupled smooth Huber regularizer.<n>Experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay converges 10-15% faster in wall-clock time.
Score: 0.2578242050187029
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the $\ell_2$ penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold $δ$, and linearly ($\ell_1$-like) once they exceed $δ$, yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at $O(1)$ extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, (c) delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers.

Related papers

Fast Compute for ML Optimization [0.0]
We study optimization for losses that admit a variance-mean scale-mixture representation.<n>The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules.<n>For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.
arXiv Detail & Related papers (2026-02-15T19:09:58Z)
Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise [17.899443444882888]
We develop a worst-case complexity theory for inequalityally preconditioned gradient descent (SPSGD)<n>We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $mathcalO(T-fracp-13p-2)$ when problem parameters are known, and $mathcalO(T-fracp-12p)$ when problem parameters are unknown.<n>In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the preconditioner and the gradient estimates.
arXiv Detail & Related papers (2026-02-13T19:29:17Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
VAMO: Efficient Zeroth-Order Variance Reduction for SGD with Faster Convergence [6.574641780732972]
Large-scale non problems are common in deep learning.<n>First-order (FO)s serve as today's baselines.<n>ZO algorithms reduce the size of computation and memory costs.<n>VAMO achieves these gains with smaller dynamic memory requirements.
arXiv Detail & Related papers (2025-05-20T05:31:15Z)
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z)
Decoupled Weight Decay for Any $p$ Norm [1.1510009152620668]
We consider a simple yet effective approach to sparsification, based on the Bridge, $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L$ weight decay to any $p$ norm. We empirically demonstrate that it leads to highly sparse networks, while maintaining performance comparable to standard $L$ regularization.
arXiv Detail & Related papers (2024-04-16T18:02:15Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Training \beta-VAE by Aggregating a Learned Gaussian Posterior with a Decoupled Decoder [0.553073476964056]
Current practices in VAE training often result in a trade-off between the reconstruction fidelity and the continuity$/$disentanglement of the latent space. We present intuitions and a careful analysis of the antagonistic mechanism of the two losses, and propose a simple yet effective two-stage method for training a VAE. We evaluate the method using a medical dataset intended for 3D skull reconstruction and shape completion, and the results indicate promising generative capabilities of the VAE trained using the proposed method.
arXiv Detail & Related papers (2022-09-29T13:49:57Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks. We show that this general approach is robust to rescaling of parameter and loss. We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms. Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.