Related papers: Early alignment in two-layer networks training is a two-edged sword

Early alignment in two-layer networks training is a two-edged sword

URL: http://arxiv.org/abs/2401.10791v1
Date: Fri, 19 Jan 2024 16:23:53 GMT
Title: Early alignment in two-layer networks training is a two-edged sword
Authors: Etienne Boursier, Nicolas Flammarion
Abstract summary: Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. Small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al.
Score: 28.86954341732928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018) . For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.

Related papers

Simplicity bias and optimization threshold in two-layer ReLU networks [24.43739371803548]
We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data. Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
arXiv Detail & Related papers (2024-10-03T09:58:57Z)
Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations [2.310288676109785]
This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks. The weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker points.
arXiv Detail & Related papers (2024-03-12T23:17:32Z)
Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss. We describe the convergent point of trajectories with arbitrary starting point under flow. We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z)
Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies. We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z)
On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow. We show that the squared loss converges exponentially to its optimum. We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z)
Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems. We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent [55.96478231566129]
We show that relative scales play an important role in determining the learned model. We develop a technique for deriving the inductive bias of gradient-flow.
arXiv Detail & Related papers (2021-02-19T07:10:48Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.