When do spectral gradient updates help in deep learning?
- URL: http://arxiv.org/abs/2512.04299v1
- Date: Wed, 03 Dec 2025 22:22:09 GMT
- Title: When do spectral gradient updates help in deep learning?
- Authors: Damek Davis, Dmitriy Drusvyatskiy,
- Abstract summary: We propose a simple condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step.<n>We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training.
- Score: 7.5757345574662205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
Related papers
- Closed-Form Last Layer Optimization [72.49151473937319]
Under a squared loss, the optimal solution to the linear last layer weights is known in closed-form.<n>We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer.
arXiv Detail & Related papers (2025-10-06T09:14:39Z) - Gradient Equilibrium in Online Learning: Theory and Applications [56.02856551198923]
gradient equilibrium is achieved by standard online learning methods.<n> gradient equilibrium translates into an interpretable and meaningful property in online prediction problems.<n>We show that gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions.
arXiv Detail & Related papers (2025-01-14T18:59:09Z) - Sampling from Gaussian Process Posteriors using Stochastic Gradient
Descent [43.097493761380186]
gradient algorithms are an efficient method of approximately solving linear systems.
We show that gradient descent produces accurate predictions, even in cases where it does not converge quickly to the optimum.
Experimentally, gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks.
arXiv Detail & Related papers (2023-06-20T15:07:37Z) - Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - GradViT: Gradient Inversion of Vision Transformers [83.54779732309653]
We demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks.
We introduce a method, named GradViT, that optimize random noise into naturally looking images.
We observe unprecedentedly high fidelity and closeness to the original (hidden) data.
arXiv Detail & Related papers (2022-03-22T17:06:07Z) - Comparing Classes of Estimators: When does Gradient Descent Beat Ridge
Regression in Linear Models? [46.01087792062936]
We compare classes of estimators via the relative performance of the emphbest method in the class
This allows us to rigorously quantify the tuning sensitivity of learning algorithms.
arXiv Detail & Related papers (2021-08-26T16:01:37Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Decreasing scaling transition from adaptive gradient descent to
stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda.
Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z) - Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent.
Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z) - Gravilon: Applications of a New Gradient Descent Method to Machine
Learning [0.5352699766206809]
We provide a novel gradient descent algorithm, called Gravilon, that uses the geometry of the hypersurface to modify the length of the step in the direction of the gradient.
Using neural networks, we provide promising experimental results comparing the accuracy and efficiency of Gravilon against commonly used gradient descent algorithms on MNIST digit classification.
arXiv Detail & Related papers (2020-08-26T04:02:02Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.