Related papers: NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

URL: http://arxiv.org/abs/2603.03597v1
Date: Wed, 04 Mar 2026 00:10:14 GMT
Title: NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, Alexander Long,
Abstract summary: We show that despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines.<n>We propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure.
Score: 50.27276603708547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

Related papers

Muon+: Towards Better Muon via One Additional Normalization Step [18.816463168231618]
We propose a simple yet effective enhancement to Muon, namely Muon+.<n>We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures.
arXiv Detail & Related papers (2026-02-25T04:04:00Z)
Muon in Associative Memory Learning: Training Dynamics and Scaling Laws [23.350512542598803]
We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs.<n>We show that Muon mitigates this imbalance, leading to faster and more uniform progress.
arXiv Detail & Related papers (2026-02-05T14:49:40Z)
Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum [19.385264518362472]
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks.<n>We propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum.<n>Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines.
arXiv Detail & Related papers (2026-01-21T02:41:56Z)
Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs [80.72350166388601]
Nemotron Elastic is a framework for building reasoning-oriented LLMs.<n>It embeds nested submodels within a single parent model.<n>Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment.
arXiv Detail & Related papers (2025-11-20T18:59:21Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
REG: A Regularization Optimizer for Robust Training Dynamics [24.850151895583494]
Row-and-Column-Scaling (RACS) operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics.<n>We demonstrate that our REG achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.
arXiv Detail & Related papers (2025-10-04T06:05:57Z)
AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
AdaMuon combines element-wise adaptivity with orthogonal updates for large-scale neural network training.<n>AdaMuon maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
arXiv Detail & Related papers (2025-07-15T05:49:37Z)
Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning [76.88243649182886]
Hybrid architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance.<n>Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost.<n>We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities.
arXiv Detail & Related papers (2025-04-15T17:26:29Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense [52.66971714830943]
Masked image modeling (MIM) has made it a prevailing framework for self-supervised visual representation learning. In this paper, we investigate how this powerful self-supervised learning paradigm can provide adversarial robustness to downstream classifiers. We propose an adversarial defense method, referred to as De3, by exploiting the pretrained decoder for denoising.
arXiv Detail & Related papers (2023-02-02T12:37:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.