Related papers: Optimizer choice matters for the emergence of Neural Collapse

Optimizer choice matters for the emergence of Neural Collapse

URL: http://arxiv.org/abs/2602.16642v3
Date: Wed, 25 Feb 2026 14:03:15 GMT
Title: Optimizer choice matters for the emergence of Neural Collapse
Authors: Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi,
Abstract summary: Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in representations of deep neural networks during the terminal phase of training.<n>Existing analyses largely ignore the role of Neural Collapse, thereby suggesting that NC is universal across optimization methods.<n>In this work, we demonstrate that the choice of quantified NC plays a critical role in emergence of NC.
Score: 4.951149983257743
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.

Related papers

Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data [52.737775129027575]
We show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits Neural Collapse (NC)<n>We reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.
arXiv Detail & Related papers (2025-10-24T01:36:19Z)
Beyond Unconstrained Features: Neural Collapse for Shallow Neural Networks with General Data [0.8594140167290099]
Neural collapse (NC) is a phenomenon that emerges at the terminal phase of the training of deep neural networks (DNNs) We provide a complete characterization of when the NC occurs for two or three-layer neural networks.
arXiv Detail & Related papers (2024-09-03T12:30:21Z)
Can Kernel Methods Explain How the Data Affects Neural Collapse? [9.975341265604577]
"Neural Collapse" (NC) phenomenon emerges when training neural network (NN) classifiers beyond the zero training error point.<n>This paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs.
arXiv Detail & Related papers (2024-06-04T08:33:56Z)
Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges [132.62934175555145]
Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT) We propose a theoretical explanation for why continuing training can still lead to accuracy improvement on test set, even after the train accuracy has reached 100%. We refer to this newly discovered property as "non-conservative generalization"
arXiv Detail & Related papers (2023-10-12T14:29:02Z)
Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay [0.6813925418351435]
Neural Collapse (NC) is a geometric structure recently observed at the terminal phase of training deep neural networks. We demonstrate that batch normalization (BN) and weight decay (WD) critically influence the emergence of NC. Our experiments substantiate theoretical insights by showing that models demonstrate a stronger presence of NC with BN, appropriate WD values, lower loss, and lower last-layer feature norm.
arXiv Detail & Related papers (2023-09-09T00:05:45Z)
Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network. We provide analytical expressions for these speed limits for linear and linearizable neural networks. Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z)
A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks [44.31777384413466]
Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. In this paper, we focus on node-wise classification and explore the feature evolution through the lens of the "Neural Collapse" phenomenon. We show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse.
arXiv Detail & Related papers (2023-07-04T23:03:21Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model [21.79259092920587]
We show that in a deep unconstrained features model, the unique global optimum for binary classification exhibits all the properties typical of deep neural collapse (DNC) We also empirically show that (i) by optimizing deep unconstrained features models via gradient descent, the resulting solution agrees well with our theory, and (ii) trained networks recover the unconstrained features suitable for DNC.
arXiv Detail & Related papers (2023-05-22T15:51:28Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
Extended Unconstrained Features Model for Exploring Deep Neural Collapse [59.59039125375527]
Recently, a phenomenon termed "neural collapse" (NC) has been empirically observed in deep neural networks. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" In this paper, we study the UFM for the regularized MSE loss, and show that the minimizers' features can be more structured than in the cross-entropy case.
arXiv Detail & Related papers (2022-02-16T14:17:37Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.