Towards Demystifying the Generalization Behaviors When Neural Collapse
Emerges
- URL: http://arxiv.org/abs/2310.08358v1
- Date: Thu, 12 Oct 2023 14:29:02 GMT
- Title: Towards Demystifying the Generalization Behaviors When Neural Collapse
Emerges
- Authors: Peifeng Gao, Qianqian Xu, Yibo Yang, Peisong Wen, Huiyang Shao,
Zhiyong Yang, Bernard Ghanem, Qingming Huang
- Abstract summary: Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT)
We propose a theoretical explanation for why continuing training can still lead to accuracy improvement on test set, even after the train accuracy has reached 100%.
We refer to this newly discovered property as "non-conservative generalization"
- Score: 132.62934175555145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Collapse (NC) is a well-known phenomenon of deep neural networks in
the terminal phase of training (TPT). It is characterized by the collapse of
features and classifier into a symmetrical structure, known as simplex
equiangular tight frame (ETF). While there have been extensive studies on
optimization characteristics showing the global optimality of neural collapse,
little research has been done on the generalization behaviors during the
occurrence of NC. Particularly, the important phenomenon of generalization
improvement during TPT has been remaining in an empirical observation and
lacking rigorous theoretical explanation. In this paper, we establish the
connection between the minimization of CE and a multi-class SVM during TPT, and
then derive a multi-class margin generalization bound, which provides a
theoretical explanation for why continuing training can still lead to accuracy
improvement on test set, even after the train accuracy has reached 100%.
Additionally, our further theoretical results indicate that different alignment
between labels and features in a simplex ETF can result in varying degrees of
generalization improvement, despite all models reaching NC and demonstrating
similar optimization performance on train set. We refer to this newly
discovered property as "non-conservative generalization". In experiments, we
also provide empirical observations to verify the indications suggested by our
theoretical results.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - A Neural Collapse Perspective on Feature Evolution in Graph Neural
Networks [44.31777384413466]
Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data.
In this paper, we focus on node-wise classification and explore the feature evolution through the lens of the "Neural Collapse" phenomenon.
We show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse.
arXiv Detail & Related papers (2023-07-04T23:03:21Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained
Features Model [21.79259092920587]
We show that in a deep unconstrained features model, the unique global optimum for binary classification exhibits all the properties typical of deep neural collapse (DNC)
We also empirically show that (i) by optimizing deep unconstrained features models via gradient descent, the resulting solution agrees well with our theory, and (ii) trained networks recover the unconstrained features suitable for DNC.
arXiv Detail & Related papers (2023-05-22T15:51:28Z) - On Provable Benefits of Depth in Training Graph Convolutional Networks [13.713485304798368]
Graph Convolutional Networks (GCNs) are known to suffer from performance degradation as the number of layers increases.
We argue that there exists a discrepancy between the theoretical understanding of over-smoothing and the practical capabilities of GCNs.
arXiv Detail & Related papers (2021-10-28T14:50:47Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.