Learning Deep ReLU Networks Is Fixed-Parameter Tractable
- URL: http://arxiv.org/abs/2009.13512v1
- Date: Mon, 28 Sep 2020 17:58:43 GMT
- Title: Learning Deep ReLU Networks Is Fixed-Parameter Tractable
- Authors: Sitan Chen, Adam R. Klivans, Raghu Meka
- Abstract summary: We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs.
We give an algorithm whose running time is a fixed weights in the ambient dimension.
Our bounds depend on the number of hidden units, depth, spectral norm of the spectral norm, and Lipschitz constant.
- Score: 21.625005195943707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of learning an unknown ReLU network with respect to
Gaussian inputs and obtain the first nontrivial results for networks of depth
more than two. We give an algorithm whose running time is a fixed polynomial in
the ambient dimension and some (exponentially large) function of only the
network's parameters.
Our bounds depend on the number of hidden units, depth, spectral norm of the
weight matrices, and Lipschitz constant of the overall network (we show that
some dependence on the Lipschitz constant is necessary). We also give a bound
that is doubly exponential in the size of the network but is independent of
spectral norm. These results provably cannot be obtained using gradient-based
methods and give the first example of a class of efficiently learnable neural
networks that gradient descent will fail to learn.
In contrast, prior work for learning networks of depth three or higher
requires exponential time in the ambient dimension, even when the above
parameters are bounded by a constant. Additionally, all prior work for the
depth-two case requires well-conditioned weights and/or positive coefficients
to obtain efficient run-times. Our algorithm does not require these
assumptions.
Our main technical tool is a type of filtered PCA that can be used to
iteratively recover an approximate basis for the subspace spanned by the hidden
units in the first layer. Our analysis leverages new structural results on
lattice polynomials from tropical geometry.
Related papers
- Asymptotics of Learning with Deep Structured (Random) Features [9.366617422860543]
For a large class of feature maps we provide a tight characterisation of the test error associated with learning the readout layer.
In some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
arXiv Detail & Related papers (2024-02-21T18:35:27Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - Sharper analysis of sparsely activated wide neural networks with
trainable biases [103.85569570164404]
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime.
Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network.
Since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK, this work further studies the least eigenvalue of the limiting NTK.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer.
We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit.
Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - Efficiently Learning Any One Hidden Layer ReLU Network From Queries [27.428198343906352]
We give the first-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network.
Ours is the first with fully-time guarantees of efficiency even for worst-case networks.
arXiv Detail & Related papers (2021-11-08T18:59:40Z) - Path Regularization: A Convexity and Sparsity Inducing Regularization
for Parallel ReLU Networks [75.33431791218302]
We study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape.
We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases.
arXiv Detail & Related papers (2021-10-18T18:00:36Z) - Deep Networks Provably Classify Data on Curves [12.309532551321334]
We study a model problem that uses a deep fully-connected neural network to classify data drawn from two disjoint smooth curves on the unit sphere.
We prove that when (i) the network depth is large to certain properties that set the difficulty of the problem and (ii) the network width and number of samples is intrinsic in the relative depth, randomly-d gradient descent quickly learns to correctly classify all points on the two curves with high probability.
arXiv Detail & Related papers (2021-07-29T20:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.