Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
- URL: http://arxiv.org/abs/2602.05725v1
- Date: Thu, 05 Feb 2026 14:49:40 GMT
- Title: Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
- Authors: Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang,
- Abstract summary: We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs.<n>We show that Muon mitigates this imbalance, leading to faster and more uniform progress.
- Score: 23.350512542598803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon's optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.
Related papers
- NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training [50.27276603708547]
We show that despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines.<n>We propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure.
arXiv Detail & Related papers (2026-03-04T00:10:14Z) - Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z) - Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise [43.39716211464324]
Sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW.<n>In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise.
arXiv Detail & Related papers (2026-02-07T07:47:14Z) - Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum [19.385264518362472]
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks.<n>We propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum.<n>Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines.
arXiv Detail & Related papers (2026-01-21T02:41:56Z) - Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z) - Towards Arbitrary Motion Completing via Hierarchical Continuous Representation [64.6525112550758]
We propose a novel parametric activation-induced hierarchical implicit representation framework, called NAME, based on Implicit Representations (INRs)<n>Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns.
arXiv Detail & Related papers (2025-12-24T14:07:04Z) - MuonBP: Faster Muon via Block-Periodic Orthogonalization [24.232069944820513]
We show how to adjust the learning rate from the baseline to MuonBP and give guarantees for this algorithm.<n>When training an 8B model with eight-way tensor tensor and ZeRO statewiseing, MuonBP achieves 8% Muon with no degradation in performance.
arXiv Detail & Related papers (2025-10-19T19:56:05Z) - NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z) - DeMuon: A Decentralized Muon for Matrix Optimization over Graphs [20.832302616074966]
DeMuon is a method for decentralized matrix optimization over a given communication topology.<n>We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity.
arXiv Detail & Related papers (2025-10-01T19:06:11Z) - Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training [3.1922198632169327]
Recently, the Muon citejordanmuon has gained significant attention for its strong performance in foundation model training.<n>We propose low-rank matrix-signed gradient descent and a low-rank variant of Muon.
arXiv Detail & Related papers (2025-09-15T14:28:53Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial
Attacks [86.88061841975482]
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle.
We use this setting to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method(FGSM)
We show that the method uses fewer queries and achieves higher attack success rates than the current state of the art.
arXiv Detail & Related papers (2020-10-08T18:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.