Related papers: Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

URL: http://arxiv.org/abs/2602.07425v1
Date: Sat, 07 Feb 2026 07:47:14 GMT
Title: Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
Authors: Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang,
Abstract summary: Sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW.<n>In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise.
Score: 43.39716211464324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

Related papers

Towards a Theoretical Understanding to the Generalization of RLHF [15.278675771756541]
We build the generalization theory on RLHF of LLMs under the linear reward model.<n>We argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
arXiv Detail & Related papers (2026-01-23T02:30:16Z)
How Well Can Preference Optimization Generalize Under Noisy Feedback? [7.374590753074647]
Preference optimization trains models to distinguish between preferred and non-preferred responses based on human feedback.<n>Most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments.<n>This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions.
arXiv Detail & Related papers (2025-10-01T20:56:31Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Unmasking Bias in Diffusion Model Training [40.90066994983719]
Denoising diffusion models have emerged as a dominant approach for image generation. They still suffer from slow convergence in training and color shift issues in sampling. In this paper, we identify that these obstacles can be largely attributed to bias and suboptimality inherent in the default training paradigm.
arXiv Detail & Related papers (2023-10-12T16:04:41Z)
Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework. We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels. Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
Heavy-tailed denoising score matching [5.371337604556311]
We develop an iterative noise scaling algorithm to consistently initialise the multiple levels of noise in Langevin dynamics. On the practical side, our use of heavy-tailed DSM leads to improved score estimation, controllable sampling convergence, and more balanced unconditional generative performance for imbalanced datasets.
arXiv Detail & Related papers (2021-12-17T22:04:55Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
Beyond variance reduction: Understanding the true impact of baselines on policy optimization [24.09670734037029]
We show that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics.
arXiv Detail & Related papers (2020-08-31T17:52:09Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.