Related papers: Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

URL: http://arxiv.org/abs/2602.06880v1
Date: Fri, 06 Feb 2026 17:06:42 GMT
Title: Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization
Authors: Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich,
Abstract summary: We reform the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term.<n>This produces $textbfDeVA$ ($textbfV$ariance $textbfA$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization.
Score: 14.136955342888987
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

Related papers

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization [54.723834588133165]
We study online linear optimization with matrix variables by the operatorAML, a setting where the geometry renders designing datadependent and efficient adaptive algorithms challenging.<n>We instantiate this framework with two efficient methods that avoid projections.<n>We show both methods admit closed-form updates match one-sided Shampoo's regret up to a constant factor, while significantly reducing computational cost.
arXiv Detail & Related papers (2026-02-09T03:09:47Z)
OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases [29.60546958677364]
nameA combines spectral control from update directions with coordinate control from sign updates.<n>We prove convergence under a mild, empirically verified diagonal-isotropy assumption.<n>nameA matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level state.
arXiv Detail & Related papers (2026-02-01T08:59:45Z)
Multivariate Variational Autoencoder [0.08460698440162889]
We present a VAE variant that preserves Gaussian tractability while lifting the diagonal posterior restriction.<n>MVAE factorizes each posterior covariance, where a emphglobal coupling matrix $mathbfC$ induces dataset-wide latent correlations.<n>We release a fully reproducible implementation with training/evaluation scripts and sweep utilities to facilitate fair comparison and reuse.
arXiv Detail & Related papers (2025-11-08T16:52:53Z)
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms [21.010238822100135]
Large language models require massive memory footprints, severely limiting deployment on consumer hardware.<n> Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations.<n>We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles.
arXiv Detail & Related papers (2025-09-11T17:59:51Z)
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)
Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models [43.79587815909472]
We propose AdaGram, an efficient full-matrix adaptive gradient updates.<n>AdaGram converges faster or matches the performance of diagonal adaptives when using rank five and smaller rank approximations.<n>This demonstrates AdaGram's potential as a scalable solution for adaptive optimization in large models.
arXiv Detail & Related papers (2025-08-28T13:15:05Z)
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z)
Improving Adaptive Moment Optimization via Preconditioner Diagonalization [11.01832755213396]
We show that our approach can substantially enhance the convergence speed of modern adaptives.<n>For large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam.
arXiv Detail & Related papers (2025-02-11T11:48:04Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Hybrid Model-based / Data-driven Graph Transform for Image Coding [54.31406300524195]
We present a hybrid model-based / data-driven approach to encode an intra-prediction residual block. The first $K$ eigenvectors of a transform matrix are derived from a statistical model, e.g., the asymmetric discrete sine transform (ADST) for stability. Using WebP as a baseline image, experimental results show that our hybrid graph transform achieved better energy compaction than default discrete cosine transform (DCT) and better stability than KLT.
arXiv Detail & Related papers (2022-03-02T15:36:44Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.