Related papers: Dispelling the Curse of Singularities in Neural Network Optimizations

Dispelling the Curse of Singularities in Neural Network Optimizations

URL: http://arxiv.org/abs/2602.01308v1
Date: Sun, 01 Feb 2026 16:09:06 GMT
Title: Dispelling the Curse of Singularities in Neural Network Optimizations
Authors: Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Anrui Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Wenyi Fang, Yuanyi Lin, Fan Wu, Li Shang,
Abstract summary: We show that the gradient Frobenius norms are bounded by the top singular values of the weight matrices, and as training progresses, the mutually reinforcing growth of weight and representation singularities, relaxes these bounds, escalating the risk of sharp loss explosions.<n>To counter this, we propose Parametric Singularity Smoothing (PSS), a lightweight, flexible, effective method for smoothing the singular spectra of weight matrices.
Score: 22.05217959662069
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work investigates the optimization instability of deep neural networks from a less-explored yet insightful perspective: the emergence and amplification of singularities in the parametric space. Our analysis reveals that parametric singularities inevitably grow with gradient updates and further intensify alignment with representations, leading to increased singularities in the representation space. We show that the gradient Frobenius norms are bounded by the top singular values of the weight matrices, and as training progresses, the mutually reinforcing growth of weight and representation singularities, termed the curse of singularities, relaxes these bounds, escalating the risk of sharp loss explosions. To counter this, we propose Parametric Singularity Smoothing (PSS), a lightweight, flexible, and effective method for smoothing the singular spectra of weight matrices. Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization.

Related papers

Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z)
SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z)
Random matrix theory of sparse neuronal networks with heterogeneous timescales [0.6181093777643575]
Training recurrent neuronal networks consists of excitatory (E) and inhibitory (I) units with additive noise for working memory computation.<n>Here, we investigate the dynamics near these equilibria and show that they are sparse, non-Hermitian rectangular-block matrices modified by heterogeneous synaptic decay timescales and activation-function gains.<n>An analytic description of the spectral edge is obtained, relating statistical parameters of the Jacobians to near-critical features of the equilibria essential for robust working memory computation.
arXiv Detail & Related papers (2025-12-14T17:02:22Z)
Low-Rank Tensor Recovery via Variational Schatten-p Quasi-Norm and Jacobian Regularization [49.85875869048434]
We propose a CP-based low-rank tensor function parameterized by neural networks for implicit neural representation.<n>To achieve sparser CP decomposition, we introduce a variational Schatten-p quasi-norm to prune redundant rank-1 components.<n>For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson's trace estimator.
arXiv Detail & Related papers (2025-06-27T11:23:10Z)
Consistency of augmentation graph and network approximability in contrastive learning [3.053989095162017]
We analyze pointwise and spectral consistency of the augmentation graph Laplacian.<n>We show that Laplacian converges to a weighted Laplace-Beltrami operator on the natural data manifold.<n>These consistency results ensure that the graph Laplacian spectrum effectively captures the manifold geometry.
arXiv Detail & Related papers (2025-02-06T18:55:51Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Regularization, early-stopping and dreaming: a Hopfield-like setup to address generalization and overfitting [0.0]
We look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices correspond to Hebbian kernels revised by a reiterated unlearning protocol.
arXiv Detail & Related papers (2023-08-01T15:04:30Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions [1.3401746329218014]
This paper is devoted to rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-dimensionalized data. A major ingredient of our theory is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that it is that
arXiv Detail & Related papers (2022-06-05T02:14:21Z)
Spectral Tensor Train Parameterization of Deep Learning Layers [136.4761580842396]
We study low-rank parameterizations of weight matrices with embedded spectral properties in the Deep Learning context. We show the effects of neural network compression in the classification setting and both compression and improved stability training in the generative adversarial training setting.
arXiv Detail & Related papers (2021-03-07T00:15:44Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.