Related papers: The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

URL: http://arxiv.org/abs/2602.16340v1
Date: Wed, 18 Feb 2026 10:25:07 GMT
Title: The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
Authors: Eitan Gronich, Gal Vardi,
Abstract summary: We study the implicit bias of momentum-baseds on homogeneous models.<n>We show that for smooth homogeneous models, momentum steepest descent algorithms are biased towards KKT points of the corresponding margin problem.
Score: 22.08387089416152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

Related papers

Adaptive Optimization via Momentum on Variance-Normalized Gradients [21.17954226393917]
MVN-Grad improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization.<n>Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaPropProp.
arXiv Detail & Related papers (2026-02-10T19:00:25Z)
Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
Offline Preference Optimization via Maximum Marginal Likelihood Estimation [9.001971182501501]
This work recasts alignment through the lens of Marginal Likelihood estimation.<n>Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output.<n>We show that MMPO achieves competitive or superior preference alignment while better preserving the base model's general language capabilities.
arXiv Detail & Related papers (2025-10-27T00:15:57Z)
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z)
Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees [15.366598179769918]
We provide a theoretical analysis of consistency models capable of mapping inputs at a given time to arbitrary points along the reverse trajectory.<n>We show that one can achieve a KL divergence of order $ O(varepsilon2) $ using only $ Oleft(logleft(fracdvarepsilonright) $ iterations with a constant step size.<n>We conclude that accurate learning is feasible using small discretization steps, both in smooth and non-smooth settings.
arXiv Detail & Related papers (2025-05-02T06:50:46Z)
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data [33.082961718280245]
We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD)<n>Our results prove that these algorithms converge to solutions maximizing the margin with respect to the matrix's p-norm, with established convergence rates.
arXiv Detail & Related papers (2025-02-07T05:09:32Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization [0.0]
We propose a generalization of the gradient descent iteration for local optimization. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
arXiv Detail & Related papers (2021-11-30T18:28:17Z)
A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data. Under a class of statistical models, we provide an exact analysis of the universality error of boosting. We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.