Width is Less Important than Depth in ReLU Neural Networks
- URL: http://arxiv.org/abs/2202.03841v1
- Date: Tue, 8 Feb 2022 13:07:22 GMT
- Title: Width is Less Important than Depth in ReLU Neural Networks
- Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir
- Abstract summary: We show that any target network with inputs in $mathbbRd$ can be approximated by a width $O(d)$ network.
We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$.
- Score: 40.83290846983707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We solve an open question from Lu et al. (2017), by showing that any target
network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$
network (independent of the target network's architecture), whose number of
parameters is essentially larger only by a linear factor. In light of previous
depth separation theorems, which imply that a similar result cannot hold when
the roles of width and depth are interchanged, it follows that depth plays a
more significant role than width in the expressive power of neural networks.
We extend our results to constructing networks with bounded weights, and to
constructing networks with width at most $d+2$, which is close to the minimal
possible width due to previous lower bounds. Both of these constructions cause
an extra polynomial factor in the number of parameters over the target network.
We also show an exact representation of wide and shallow networks using deep
and narrow networks which, in certain cases, does not increase the number of
parameters over the target network.
Related papers
- Expressivity and Approximation Properties of Deep Neural Networks with
ReLU$^k$ Activation [2.3020018305241337]
We investigate the expressivity and approximation properties of deep networks employing the ReLU$k$ activation function for $k geq 2$.
Although deep ReLU$k$ networks can approximates effectively, deep ReLU$k$ networks have the capability to represent higher-degrees precisely.
arXiv Detail & Related papers (2023-12-27T09:11:14Z) - Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and
Scaling Limit [48.291961660957384]
We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers exhibit transfer of optimal hyper parameters across width and depth.
Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit.
arXiv Detail & Related papers (2023-09-28T17:20:50Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Width and Depth Limits Commute in Residual Networks [26.97391529844503]
We show that taking the width and depth to infinity in a deep neural network with skip connections, results in the same covariance structure no matter how that limit is taken.
This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width.
We conduct extensive simulations that show an excellent match with our theoretical findings.
arXiv Detail & Related papers (2023-02-01T13:57:32Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Adversarial Examples in Multi-Layer Random ReLU Networks [39.797621513256026]
adversarial examples arise in ReLU networks with independent gaussian parameters.
Bottleneck layers in the network play a key role: the minimal width up to some point determines scales and sensitivities of mappings computed up to that point.
arXiv Detail & Related papers (2021-06-23T18:16:34Z) - Size and Depth Separation in Approximating Natural Functions with Neural
Networks [52.73592689730044]
We show the benefits of size and depth for approximation of natural functions with ReLU networks.
We show a complexity-theoretic barrier to proving such results beyond size $O(d)$.
We also show an explicit natural function, that can be approximated with networks of size $O(d)$.
arXiv Detail & Related papers (2021-01-30T21:30:11Z) - Neural Networks with Small Weights and Depth-Separation Barriers [40.66211670342284]
For constant depths, existing results are limited to depths $2$ and $3$, and achieving results for higher depths has been an important question.
We focus on feedforward ReLU networks, and prove fundamental barriers to proving such results beyond depth $4$.
arXiv Detail & Related papers (2020-05-31T21:56:17Z) - Quasi-Equivalence of Width and Depth of Neural Networks [10.365556153676538]
We investigate if the design of artificial neural networks should have a directional preference.
Inspired by the De Morgan law, we establish a quasi-equivalence between the width and depth of ReLU networks.
Based on our findings, a deep network has a wide equivalent, subject to an arbitrarily small error.
arXiv Detail & Related papers (2020-02-06T21:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.