Depth Separations in Neural Networks: Separating the Dimension from the Accuracy
- URL: http://arxiv.org/abs/2402.07248v2
- Date: Wed, 06 Nov 2024 19:05:34 GMT
- Title: Depth Separations in Neural Networks: Separating the Dimension from the Accuracy
- Authors: Itay Safran, Daniel Reichman, Paul Valiant,
- Abstract summary: We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs)
We show that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network.
- Score: 9.783697404304027
- License:
- Abstract: We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Depth Separation in Norm-Bounded Infinite-Width Neural Networks [55.21840159087921]
We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $ell$-norm of the weights.
We show that there are functions that are learnable with sample complexity in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks.
arXiv Detail & Related papers (2024-02-13T21:26:38Z) - How Many Neurons Does it Take to Approximate the Maximum? [10.995895410470279]
We study the size of a neural network needed to approximate the maximum function over $d$ inputs.
We provide new lower and upper bounds on the width required for approximation across various depths.
arXiv Detail & Related papers (2023-07-18T12:47:35Z) - The Implicit Bias of Minima Stability in Multivariate Shallow ReLU
Networks [53.95175206863992]
We study the type of solutions to which gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss.
We prove that although shallow ReLU networks are universal approximators, stable shallow networks are not.
arXiv Detail & Related papers (2023-06-30T09:17:39Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Sharp asymptotics on the compression of two-layer neural networks [19.683271092724937]
We study the compression of a target two-layer neural network with N nodes into a compressed network with M N nodes.
We conjecture that the optimum simplified optimization problem is achieved by taking weights on the Equi Tight Frame (ETF)
arXiv Detail & Related papers (2022-05-17T09:45:23Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Sparse Uncertainty Representation in Deep Learning with Inducing Weights [22.912675044223302]
We extend Matheron's conditional Gaussian sampling rule to enable fast weight sampling, which enables our inference method to maintain reasonable run-time as compared with ensembles.
Our approach achieves competitive performance to the state-of-the-art in prediction and uncertainty estimation tasks with fully connected neural networks and ResNets.
arXiv Detail & Related papers (2021-05-30T18:17:47Z) - Global Convergence of Three-layer Neural Networks in the Mean Field
Regime [3.553493344868413]
In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit.
Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees.
We prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime.
arXiv Detail & Related papers (2021-05-11T17:45:42Z) - Provable Memorization via Deep Neural Networks using Sub-linear
Parameters [91.0268925267129]
It is known that $O(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs.
By exploiting depth, we show that $O(N2/3)$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points.
arXiv Detail & Related papers (2020-10-26T06:19:38Z) - Learning Deep ReLU Networks Is Fixed-Parameter Tractable [21.625005195943707]
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs.
We give an algorithm whose running time is a fixed weights in the ambient dimension.
Our bounds depend on the number of hidden units, depth, spectral norm of the spectral norm, and Lipschitz constant.
arXiv Detail & Related papers (2020-09-28T17:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.