Related papers: Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

URL: http://arxiv.org/abs/2505.15239v1
Date: Wed, 21 May 2025 08:16:03 GMT
Title: Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Authors: Peter Súkeník, Christoph H. Lampert, Marco Mondelli,
Abstract summary: We prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed.<n>Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.
Score: 33.441694038617506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The empirical emergence of neural collapse -- a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks -- has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.

Related papers

Revisiting Deep Information Propagation: Fractal Frontier and Finite-size Effects [16.42026346710591]
We study information propagation in randomly neural networks with finite width and reveal that the boundary between ordered and chaotic regimes exhibits a fractal structure.<n>This shows the fundamental complexity of neural network dynamics, in a setting that is independent of input data and optimization.
arXiv Detail & Related papers (2025-08-05T08:49:24Z)
Optimal Depth of Neural Networks [2.1756081703276]
This paper introduces a formal theoretical framework to address Determining the optimal depth of a neural network.<n>We model the layer-by-layer evolution of hidden representations as a sequential decision process.<n>We propose a novel and practical regularization term, $mathcalL_rm depth$, that encourages the network to learn representations amenable to efficient, early exiting.
arXiv Detail & Related papers (2025-06-20T09:26:01Z)
Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Entropy-based Guidance of Deep Neural Networks for Accelerated Convergence and Improved Performance [0.8749675983608172]
We derive new mathematical results to measure the changes in entropy as fully-connected and convolutional neural networks process data. By measuring the change in entropy as networks process data effectively, patterns critical to a well-performing network can be visualized and identified. Experiments in image compression, image classification, and image segmentation on benchmark datasets demonstrate these losses guide neural networks to learn rich latent data representations in fewer dimensions.
arXiv Detail & Related papers (2023-08-28T23:33:07Z)
Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z)
Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data [12.225207401994737]
We show that complex systems with massive amounts of parameters exhibit the same structural properties when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of vectors.
arXiv Detail & Related papers (2023-01-01T16:29:56Z)
Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are Better Than One [32.01675089157679]
We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor. Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures. The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-16T09:09:05Z)
Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group [4.56877715768796]
We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
arXiv Detail & Related papers (2020-12-07T20:08:31Z)
On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.