Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent
- URL: http://arxiv.org/abs/2602.03001v1
- Date: Tue, 03 Feb 2026 02:16:55 GMT
- Title: Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent
- Authors: Hiroki Naganuma, Shagun Gupta, Youssef Briki, Ioannis Mitliagkas, Irina Rish, Parameswaran Raman, Hao-Jun Michael Shi,
- Abstract summary: Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative.<n>We derive gradient noise metrics for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms.<n>Our experiments demonstrate that adaptive batch size strategies using non-Euclidean norms enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon.
- Score: 21.698853170807684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.
Related papers
- Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural Networks [3.0102563923286856]
We introduce SA-SGLD, which employs time rescaling to modulate the stepsize according to a monitored quantity.<n>We show that our method can achieve more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with BNNs using sharp priors.
arXiv Detail & Related papers (2025-11-11T13:15:17Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Resampling Stochastic Gradient Descent Cheaply for Efficient Uncertainty
Quantification [6.832893024035705]
We investigate two computationally cheap resampling-based methods to construct confidence intervals for SGD solutions.
One uses multiple, but few, SGDs in parallel via resampling with replacement from the data, and another operates this in an online fashion.
arXiv Detail & Related papers (2023-10-17T08:18:10Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise
Injection for Reaching Flat Minima [22.46916792590578]
gradient descent (SGD) method is most widely used for deep neural network (DNN) training.
Weight noise injection has been extensively studied for finding flat minima using the SGD method.
We devise a new weight-noise injection-based SGD method that adds symmetrical noises to the weights.
arXiv Detail & Related papers (2020-09-05T07:02:02Z) - On the Promise of the Stochastic Generalized Gauss-Newton Method for
Training DNNs [37.96456928567548]
We study a generalized Gauss-Newton method (SGN) for training DNNs.
SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge.
We show that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime.
This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform.
arXiv Detail & Related papers (2020-06-03T17:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.