Global Convergence of Deep Networks with One Wide Layer Followed by
Pyramidal Topology
- URL: http://arxiv.org/abs/2002.07867v3
- Date: Thu, 17 Dec 2020 19:45:04 GMT
- Title: Global Convergence of Deep Networks with One Wide Layer Followed by
Pyramidal Topology
- Authors: Quynh Nguyen and Marco Mondelli
- Abstract summary: We prove that, for deep networks, a single layer of width $N$ following the input layer suffices to ensure a similar guarantee.
All the remaining layers are allowed to have constant widths, and form a pyramidal topology.
- Score: 28.49901662584467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that gradient descent can find a global minimum for
over-parameterized neural networks where the widths of all the hidden layers
scale polynomially with $N$ ($N$ being the number of training samples). In this
paper, we prove that, for deep networks, a single layer of width $N$ following
the input layer suffices to ensure a similar guarantee. In particular, all the
remaining layers are allowed to have constant widths, and form a pyramidal
topology. We show an application of our result to the widely used LeCun's
initialization and obtain an over-parameterization requirement for the single
wide layer of order $N^2.$
Related papers
- Depth-Bounds for Neural Networks via the Braid Arrangement [5.127394801557798]
We prove a non-constant lower bound of $Omega(loglog d)$ hidden layers required to represent the maximum of $d$ numbers.
We show that a rank-3 maxout layer followed by a rank-2 maxout layer is sufficient to represent the maximum of 7 numbers.
arXiv Detail & Related papers (2025-02-13T13:37:52Z) - Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Implicit Hypersurface Approximation Capacity in Deep ReLU Networks [0.0]
We develop a geometric approximation theory for deep feed-forward neural networks with ReLU activations.
We show that a deep fully-connected ReLU network of width $d+1$ can implicitly construct an approximation as its zero contour.
arXiv Detail & Related papers (2024-07-04T11:34:42Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - On the Proof of Global Convergence of Gradient Descent for Deep ReLU
Networks with Linear Widths [9.42944841156154]
We show that gradient descent converges to a global optimum if the widths of all the hidden layers scale at least as $Omega(N8)$ ($N$ being the number of training samples)
arXiv Detail & Related papers (2021-01-24T00:29:19Z) - A simple geometric proof for the benefit of depth in ReLU networks [57.815699322370826]
We present a simple proof for the benefit of depth in multi-layer feedforward network with rectified activation ("depth separation")
We present a concrete neural network with linear depth (in $m$) and small constant width ($leq 4$) that classifies the problem with zero error.
arXiv Detail & Related papers (2021-01-18T15:40:27Z) - Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers.
We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set.
We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.