Subquadratic Overparameterization for Shallow Neural Networks
- URL: http://arxiv.org/abs/2111.01875v1
- Date: Tue, 2 Nov 2021 20:24:01 GMT
- Title: Subquadratic Overparameterization for Shallow Neural Networks
- Authors: Chaehwan Song, Ali Ramezani-Kebrya, Thomas Pethick, Armin Eftekhari,
Volkan Cevher
- Abstract summary: We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
- Score: 60.721751363271146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparameterization refers to the important phenomenon where the width of a
neural network is chosen such that learning algorithms can provably attain zero
loss in nonconvex training. The existing theory establishes such global
convergence using various initialization strategies, training modifications,
and width scalings. In particular, the state-of-the-art results require the
width to scale quadratically with the number of training data under standard
initialization strategies used in practice for best generalization performance.
In contrast, the most recent results obtain linear scaling either with
requiring initializations that lead to the "lazy-training", or training only a
single layer. In this work, we provide an analytical framework that allows us
to adopt standard initialization strategies, possibly avoid lazy training, and
train all layers simultaneously in basic shallow neural networks while
attaining a desirable subquadratic scaling on the network width. We achieve the
desiderata via Polyak-Lojasiewicz condition, smoothness, and standard
assumptions on data, and use tools from random matrix theory.
Related papers
- Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training.
We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z) - Improving Generalization of Deep Neural Networks by Optimum Shifting [33.092571599896814]
We propose a novel method called emphoptimum shifting, which changes the parameters of a neural network from a sharp minimum to a flatter one.
Our method is based on the observation that when the input and output of a neural network are fixed, the matrix multiplications within the network can be treated as systems of under-determined linear equations.
arXiv Detail & Related papers (2024-05-23T02:31:55Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Robust Learning of Parsimonious Deep Neural Networks [0.0]
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network.
We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection.
We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures.
arXiv Detail & Related papers (2022-05-10T03:38:55Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - A Weight Initialization Based on the Linear Product Structure for Neural
Networks [0.0]
We study neural networks from a nonlinear point of view and propose a novel weight initialization strategy that is based on the linear product structure (LPS) of neural networks.
The proposed strategy is derived from the approximation of activation functions by using theories of numerical algebra to guarantee to find all the local minima.
arXiv Detail & Related papers (2021-09-01T00:18:59Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Optimization Theory for ReLU Neural Networks Trained with Normalization
Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers.
Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.