On Sparsity in Overparametrised Shallow ReLU Networks
- URL: http://arxiv.org/abs/2006.10225v1
- Date: Thu, 18 Jun 2020 01:35:26 GMT
- Title: On Sparsity in Overparametrised Shallow ReLU Networks
- Authors: Jaume de Dios and Joan Bruna
- Abstract summary: We study the ability of different regularisation strategies to capture solutions requiring only a finite amount of neurons, even on the infinitely wide regime.
We establish that both schemes are minimised by functions having only a finite number of neurons, irrespective of the amount of overparametrisation.
- Score: 42.33056643582297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The analysis of neural network training beyond their linearization regime
remains an outstanding open question, even in the simplest setup of a single
hidden-layer. The limit of infinitely wide networks provides an appealing route
forward through the mean-field perspective, but a key challenge is to bring
learning guarantees back to the finite-neuron setting, where practical
algorithms operate.
Towards closing this gap, and focusing on shallow neural networks, in this
work we study the ability of different regularisation strategies to capture
solutions requiring only a finite amount of neurons, even on the infinitely
wide regime. Specifically, we consider (i) a form of implicit regularisation
obtained by injecting noise into training targets [Blanc et al.~19], and (ii)
the variation-norm regularisation [Bach~17], compatible with the mean-field
scaling. Under mild assumptions on the activation function (satisfied for
instance with ReLUs), we establish that both schemes are minimised by functions
having only a finite number of neurons, irrespective of the amount of
overparametrisation. We study the consequences of such property and describe
the settings where one form of regularisation is favorable over the other.
Related papers
- Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer
Neural Networks [49.870593940818715]
We study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed.
Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors.
arXiv Detail & Related papers (2022-10-28T17:26:27Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Global Convergence Analysis of Deep Linear Networks with A One-neuron
Layer [18.06634056613645]
We consider optimizing deep linear networks which have a layer with one neuron under quadratic loss.
We describe the convergent point of trajectories with arbitrary starting point under flow.
We show specific convergence rates of trajectories that converge to the global gradientr by stages.
arXiv Detail & Related papers (2022-01-08T04:44:59Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Training Integrable Parameterizations of Deep Neural Networks in the
Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks.
For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights.
We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss.
We show that kernels typically associated with the ReLU activation function have fundamental flaws.
We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Deep Neural Networks with Trainable Activations and Controlled Lipschitz
Constant [26.22495169129119]
We introduce a variational framework to learn the activation functions of deep neural networks.
Our aim is to increase the capacity of the network while controlling an upper-bound of the Lipschitz constant.
We numerically compare our scheme with standard ReLU network and its variations, PReLU and LeakyReLU.
arXiv Detail & Related papers (2020-01-17T12:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.