Simplicity bias and optimization threshold in two-layer ReLU networks
- URL: http://arxiv.org/abs/2410.02348v1
- Date: Thu, 3 Oct 2024 09:58:57 GMT
- Title: Simplicity bias and optimization threshold in two-layer ReLU networks
- Authors: Etienne Boursier, Nicolas Flammarion,
- Abstract summary: We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data.
Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
- Score: 24.43739371803548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding generalization of overparametrized neural networks remains a fundamental challenge in machine learning. Most of the literature mostly studies generalization from an interpolation point of view, taking convergence of parameters towards a global minimum of the training loss for granted. While overparametrized architectures indeed interpolated the data for typical classification tasks, this interpolation paradigm does not seem valid anymore for more complex tasks such as in-context learning or diffusion. Instead for such tasks, it has been empirically observed that the trained models goes from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. While the former yields a poor generalization to the true population loss, the latter was observed to actually correspond to the minimiser of this true loss. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks often converge toward simpler solutions rather than interpolating the training data, which can lead to a drastic improvement on the test loss with respect to interpolating solutions. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions. This directional alignment, which occurs in the early stage of training, leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest that this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Understanding the Generalization Benefits of Late Learning Rate Decay [14.471831651042367]
We show the relation between training and testing loss in neural networks.
We introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks.
We demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss.
arXiv Detail & Related papers (2024-01-21T21:11:09Z) - Early alignment in two-layer networks training is a two-edged sword [24.43739371803548]
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning.
Small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions.
This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al.
arXiv Detail & Related papers (2024-01-19T16:23:53Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Analyzing Overfitting under Class Imbalance in Neural Networks for Image
Segmentation [19.259574003403998]
In image segmentation neural networks may overfit to the foreground samples from small structures.
In this study, we provide new insights on the problem of overfitting under class imbalance by inspecting the network behavior.
arXiv Detail & Related papers (2021-02-20T14:57:58Z) - Learning the Travelling Salesperson Problem Requires Rethinking
Generalization [9.176056742068813]
End-to-end training of neural network solvers for graph optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently.
While state-of-the-art learning-driven approaches perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales.
This work presents an end-to-end neural optimization pipeline that unifies several recent papers in order to identify the principled biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training.
arXiv Detail & Related papers (2020-06-12T10:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.