Generalization performance of narrow one-hidden layer networks in the teacher-student setting
- URL: http://arxiv.org/abs/2507.00629v2
- Date: Wed, 02 Jul 2025 15:49:53 GMT
- Title: Generalization performance of narrow one-hidden layer networks in the teacher-student setting
- Authors: Jean Barbier, Federica Gerace, Alessandro Ingrosso, Clarissa Lauditi, Enrico M. Malatesta, Gibbs Nwemadji, Rodrigo Pérez Ortiz,
- Abstract summary: We develop a general theory for narrow networks, i.e. networks with a large number of hidden units, yet much smaller than the input dimension.<n>Our theory accurately predicts the generalization error of neural networks trained on regression or classification tasks.
- Score: 40.69556943879117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the generalization abilities of neural networks for simple input-output distributions is crucial to account for their learning performance on real datasets. The classical teacher-student setting, where a network is trained from data obtained thanks to a label-generating teacher model, serves as a perfect theoretical test bed. In this context, a complete theoretical account of the performance of fully connected one-hidden layer networks in the presence of generic activation functions is lacking. In this work, we develop such a general theory for narrow networks, i.e. networks with a large number of hidden units, yet much smaller than the input dimension. Using methods from statistical physics, we provide closed-form expressions for the typical performance of both finite temperature (Bayesian) and empirical risk minimization estimators, in terms of a small number of weight statistics. In doing so, we highlight the presence of a transition where hidden neurons specialize when the number of samples is sufficiently large and proportional to the number of parameters of the network. Our theory accurately predicts the generalization error of neural networks trained on regression or classification tasks with either noisy full-batch gradient descent (Langevin dynamics) or full-batch gradient descent.
Related papers
- Statistical mechanics of extensive-width Bayesian neural networks near interpolation [4.976898227858662]
We study the supervised learning of a two-layer fully connected network with generic weight distribution and activation function.<n>We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture.<n>Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases.
arXiv Detail & Related papers (2025-05-30T17:46:59Z) - Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z) - Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - On the optimization and generalization of overparameterized implicit
neural networks [25.237054775800164]
Implicit neural networks have become increasingly attractive in the machine learning community.
We show that global convergence is guaranteed, even if only the implicit layer is trained.
This paper investigates the generalization error for implicit neural networks.
arXiv Detail & Related papers (2022-09-30T16:19:46Z) - With Greater Distance Comes Worse Performance: On the Perspective of
Layer Utilization and Model Generalization [3.6321778403619285]
Generalization of deep neural networks remains one of the main open problems in machine learning.
Early layers generally learn representations relevant to performance on both training data and testing data.
Deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data.
arXiv Detail & Related papers (2022-01-28T05:26:32Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Slope and generalization properties of neural networks [0.0]
We show that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network.
The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples.
We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
arXiv Detail & Related papers (2021-07-03T17:54:27Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Perceptron Theory Can Predict the Accuracy of Neural Networks [6.136302173351179]
Multilayer neural networks set the current state of the art for many technical classification problems.
But, these networks are still, essentially, black boxes in terms of analyzing them and predicting their performance.
Here, we develop a statistical theory for the one-layer perceptron and show that it can predict performances of a surprisingly large variety of neural networks.
arXiv Detail & Related papers (2020-12-14T19:02:26Z) - Neural Networks and Polynomial Regression. Demystifying the
Overparametrization Phenomena [17.205106391379026]
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data.
A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data.
We show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension.
arXiv Detail & Related papers (2020-03-23T20:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.