Related papers: AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates

AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates

URL: http://arxiv.org/abs/2509.24320v2
Date: Tue, 30 Sep 2025 06:34:27 GMT
Title: AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates
Authors: Dipan Maity,
Abstract summary: We study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region.<n>We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time that achieves strong performance without constructing semi-orthogonal matrices.<n>Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning. However, traditional approaches such as SVD/QR decomposition incur prohibitive computational costs of O(n^3) and underperform compared to well-tuned SGD with momentum, since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton-Schulz iterations, reducing complexity to O(n^2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi-orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton-Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to strong baselines such as AdamW and Muon. Code is available at: https://github.com/ryyzn9/AuON

Related papers

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum [5.049533819651459]
We propose a new and a diagonal extension, NAMO and NAMO-D, to provide the first principled integration of momentum with noise adaptation.<n> NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries.<n>Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon models.
arXiv Detail & Related papers (2026-02-19T05:00:39Z)
Delving into Muon and Beyond: Deep Analysis and Extensions [8.297062899157664]
We study Muon as the p = 0 endpoint of a family of spectral transformations of the form U boldsymbolp V'.<n>We find that RMS-normalized updates yield more stable optimization than first-moment updates.<n>Our results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.
arXiv Detail & Related papers (2026-02-04T15:40:47Z)
Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z)
OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases [29.60546958677364]
nameA combines spectral control from update directions with coordinate control from sign updates.<n>We prove convergence under a mild, empirically verified diagonal-isotropy assumption.<n>nameA matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level state.
arXiv Detail & Related papers (2026-02-01T08:59:45Z)
UNSO: Unified Newton Schulz Orthogonalization [10.113387496783007]
The Newton-Schulz (NS) has gained increasing interest for its role in the Muon and Stiefel manifold.<n>We consolidate the iterative structure into a unified framework, named Unified Newton-Schulz Orthogonalization (UNSO)<n>These learnable coefficients are then optimized, and achieve an outstanding performance with stable convergence.
arXiv Detail & Related papers (2026-01-18T17:54:43Z)
Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)
Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z)
Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z)
Improving Adaptive Moment Optimization via Preconditioner Diagonalization [11.01832755213396]
We show that our approach can substantially enhance the convergence speed of modern adaptives.<n>For large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam.
arXiv Detail & Related papers (2025-02-11T11:48:04Z)
Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport [18.717832661972896]
New approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. Method exactly preserves the manifold structure but does not require commonly used projection or retraction. Its generalization to adaptive learning rates is also demonstrated.
arXiv Detail & Related papers (2022-05-27T18:01:45Z)
Matrix Completion via Non-Convex Relaxation and Adaptive Correlation Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions. We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z)
DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization [77.34726150561087]
We develop and analyze DASHA: a new family of methods for noneps distributed optimization problems. Unlike MARINA, the new methods DASHA, DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for learning.
arXiv Detail & Related papers (2022-02-02T20:10:40Z)
Learning Augmentation Distributions using Transformed Risk Minimization [47.236227685707526]
We propose a new emphTransformed Risk Minimization (TRM) framework as an extension of classical risk minimization. As a key application, we focus on learning augmentations to improve classification performance with a given class of predictors.
arXiv Detail & Related papers (2021-11-16T02:07:20Z)
Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z)
Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI) We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)
Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC) Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features. We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.