Related papers: How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data

How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data

URL: http://arxiv.org/abs/2510.22980v1
Date: Mon, 27 Oct 2025 04:00:42 GMT
Title: How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
Authors: Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis,
Abstract summary: We show that spectrum-aware matrix generalizations such as Muon and Shampoo might outperform competitive algorithms.<n>We empirically verify our theoretical findings on a variety of imbalanced datasets.
Score: 38.54408542311739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) -- each update step is $UV^T$ where $U\Sigma V^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data's underlying components.

Related papers

InfoNCE Induces Gaussian Distribution [7.8922077372145685]
A loss in contrastive training is InfoNCE and its variants.<n>We show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training.<n>The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
arXiv Detail & Related papers (2026-02-27T13:35:58Z)
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training [7.5041863920639456]
Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy across different sub-populations.<n>This paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model.<n>Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias.
arXiv Detail & Related papers (2024-05-28T15:50:10Z)
Hodge-Aware Contrastive Learning [101.56637264703058]
Simplicial complexes prove effective in modeling data with multiway dependencies. We develop a contrastive self-supervised learning approach for processing simplicial data.
arXiv Detail & Related papers (2023-09-14T00:40:07Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Spectral Evolution and Invariance in Linear-width Neural Networks [8.419660614226816]
We investigate the spectral properties of linear-width feed-forward neural networks. We show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel exhibit heavy tail behavior.
arXiv Detail & Related papers (2022-11-11T23:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.