Related papers: Muon Outperforms Adam in Tail-End Associative Memory Learning

Muon Outperforms Adam in Tail-End Associative Memory Learning

URL: http://arxiv.org/abs/2509.26030v2
Date: Sun, 05 Oct 2025 09:26:34 GMT
Title: Muon Outperforms Adam in Tail-End Associative Memory Learning
Authors: Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan,
Abstract summary: We show that Muon consistently achieves balanced learning across classes regardless of feature embeddings.<n>Our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories.
Score: 118.98991042050532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

Related papers

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters [16.624341041698013]
Muon has perhaps gained the highest popularity due to its superior training speed.<n>This paper investigates the potential downsides stemming from the mechanism driving this speedup.<n>Muon struggles to uncover common underlying structure across tasks, and is more prone to fitting spurious features.
arXiv Detail & Related papers (2026-02-28T17:37:15Z)
A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models [36.99162631444728]
We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block.<n>Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise.<n>We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds.
arXiv Detail & Related papers (2026-02-13T00:44:26Z)
Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning [9.233407096706744]
We introduce structured inductive priors into the self-attention mechanism of the dynamics head.<n>Experiments on the Atari 100k benchmark show that most efficiency gains arise from the Gaussian prior.
arXiv Detail & Related papers (2025-11-10T10:53:16Z)
How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data [38.54408542311739]
We show that spectrum-aware matrix generalizations such as Muon and Shampoo might outperform competitive algorithms.<n>We empirically verify our theoretical findings on a variety of imbalanced datasets.
arXiv Detail & Related papers (2025-10-27T04:00:42Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Conda: Column-Normalized Adam for Training Large Language Models Faster [70.66067959375748]
Column-Normalized Adam (Conda) is a novel approach to large language models (LLMs)<n>Conda projects updates into a subspace and applies column-wise second moment normalization based on the projected gradients.<n>Experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training.
arXiv Detail & Related papers (2025-09-29T02:58:19Z)
Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [51.92313556418432]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results. We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z)
Inducing Neural Collapse in Deep Long-tailed Learning [13.242721780822848]
We propose two explicit feature regularization terms to learn high-quality representation for class-imbalanced data. With the proposed regularization, Neural Collapse phenomena will appear under the class-imbalanced distribution. Our method is easily implemented, highly effective, and can be plugged into most existing methods.
arXiv Detail & Related papers (2023-02-24T05:07:05Z)
Improving Tail-Class Representation with Centroid Contrastive Learning [145.73991900239017]
We propose interpolative centroid contrastive learning (ICCL) to improve long-tailed representation learning. ICCL interpolates two images from a class-agnostic sampler and a class-aware sampler, and trains the model such that the representation of the ICCL can be used to retrieve the centroids for both source classes. Our result shows a significant accuracy gain of 2.8% on the iNaturalist 2018 dataset with a real-world long-tailed distribution.
arXiv Detail & Related papers (2021-10-19T15:24:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.