Iterative Orthogonalization Scaling Laws
- URL: http://arxiv.org/abs/2505.04005v2
- Date: Thu, 08 May 2025 03:03:40 GMT
- Title: Iterative Orthogonalization Scaling Laws
- Authors: Devan Selvaraj,
- Abstract summary: The muon has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam matrices.<n>This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The muon optimizer has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam optimizer. Recently, care has been taken to document the scaling laws of hyper-parameters under muon such as weight decay and learning rate. However, at much larger scales the iterative orthogonalization procedure present in muon may suffer a possible issue as the singular values of random matrices shrink with scale. This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.
Related papers
- Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers [11.445970271488095]
We introduce learnable multipliers to learn the optimal scale for applying weight decay to matrix layers.<n>Our method can be viewed as a learnable, more expressive generalization of muP multipliers.<n>It outperforms a well-tuned muP baseline, reduces the computational overhead of tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers.
arXiv Detail & Related papers (2026-01-08T12:41:49Z) - Dion2: A Simple Method to Shrink Matrix in Muon [19.766325230655173]
We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's iteration compared to prior approaches.<n>At a high level, Dion2 selects a fraction of rows or columns at each and orthonormalizes only those.
arXiv Detail & Related papers (2025-12-01T16:58:10Z) - AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z) - Improving Adaptive Moment Optimization via Preconditioner Diagonalization [11.01832755213396]
We show that our approach can substantially enhance the convergence speed of modern adaptives.<n>For large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam.
arXiv Detail & Related papers (2025-02-11T11:48:04Z) - On the phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance [5.058205542605482]
We make progress towards understanding of matrix denoising when the signal is a factored matrix $XXintercal$ that is not rotationally invariant.<n>We argue that it is only beyond the transition that factorisation, i.e., estimating $X$ itself, becomes possible up to irresolvable universality.
arXiv Detail & Related papers (2024-11-04T10:50:37Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood
Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions.
Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation.
In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z) - Matrix Completion via Non-Convex Relaxation and Adaptive Correlation
Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions.
We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z) - Interpolation can hurt robust generalization even when there is no noise [76.3492338989419]
We show that avoiding generalization through ridge regularization can significantly improve generalization even in the absence of noise.
We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.
arXiv Detail & Related papers (2021-08-05T23:04:15Z) - A Random Matrix Perspective on Random Tensors [40.89521598604993]
We study the spectra of random matrices arising from contractions of a given random tensor.
Our technique yields a hitherto unknown characterization of the local maximum of the ML problem.
Our approach is versatile and can be extended to other models, such as asymmetric, non-Gaussian and higher-order ones.
arXiv Detail & Related papers (2021-08-02T10:42:22Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - The Slow Deterioration of the Generalization Error of the Random Feature
Model [12.865834066050427]
We show, theoretically and experimentally, that there is a dynamic self-correction mechanism at work.
This gives us ample time to stop the training process and obtain solutions with good generalization property.
arXiv Detail & Related papers (2020-08-13T00:35:49Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Optimal Iterative Sketching with the Subsampled Randomized Hadamard
Transform [64.90148466525754]
We study the performance of iterative sketching for least-squares problems.
We show that the convergence rate for Haar and randomized Hadamard matrices are identical, andally improve upon random projections.
These techniques may be applied to other algorithms that employ randomized dimension reduction.
arXiv Detail & Related papers (2020-02-03T16:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.