Related papers: OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell

OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases

URL: http://arxiv.org/abs/2602.01105v2
Date: Mon, 09 Feb 2026 09:21:47 GMT
Title: OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases
Authors: Zixiao Wang, Yifei Shen, Huishuai Zhang,
Abstract summary: nameA combines spectral control from update directions with coordinate control from sign updates.<n>We prove convergence under a mild, empirically verified diagonal-isotropy assumption.<n>nameA matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level state.
Score: 29.60546958677364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce \nameA{} (\fullname{}), which combines spectral control from orthogonalized update directions with $\ell_\infty$-style coordinate control from sign updates. \nameA{} forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton--Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and $\ell_\infty$ constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, \nameA{} matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization [14.136955342888987]
We reform the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term.<n>This produces $textbfDeVA$ ($textbfV$ariance $textbfA$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization.
arXiv Detail & Related papers (2026-02-06T17:06:42Z)
RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization [1.3537117504260623]
Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in settings.<n>We propose textbfRanSOM, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $_t$.<n>We instantiate this framework in two algorithms: textbfRanSOM-E for unconstrained optimization and textbfRanSOM-B for constrained optimization.
arXiv Detail & Related papers (2026-02-06T16:09:36Z)
Scaling Bidirectional Spans and Span Violations in Attention Mechanism [5.755498052202004]
canonical $O(N2)$ Transformer remains the empirical performance frontier in sequence modeling.<n>We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans.<n>We demonstrate that selectively scaling these components, focusing primarily on $0th$ order bidirectional parallel spans, yields the most effective learning signal.
arXiv Detail & Related papers (2025-12-15T07:03:24Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)
Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z)
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
Implicit Bias and Fast Convergence Rates for Self-attention [26.766649949420746]
We study the fundamental optimization principles of self-attention, the defining mechanism of transformers.<n>We analyze the implicit bias of gradient-baseds in a self-attention layer with a decoder in a linear classification.
arXiv Detail & Related papers (2024-02-08T15:15:09Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z)
Optimal and instance-dependent guarantees for Markovian linear stochastic approximation [47.912511426974376]
We show a non-asymptotic bound of the order $t_mathrmmix tfracdn$ on the squared error of the last iterate of a standard scheme. We derive corollaries of these results for policy evaluation with Markov noise.
arXiv Detail & Related papers (2021-12-23T18:47:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.