Related papers: Width is Less Important than Depth in ReLU Neural Networks

Width is Less Important than Depth in ReLU Neural Networks

URL: http://arxiv.org/abs/2202.03841v1
Date: Tue, 8 Feb 2022 13:07:22 GMT
Title: Width is Less Important than Depth in ReLU Neural Networks
Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir
Abstract summary: We show that any target network with inputs in $mathbbRd$ can be approximated by a width $O(d)$ network. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$.
Score: 40.83290846983707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.

Related papers

Expressivity and Approximation Properties of Deep Neural Networks with ReLU$^k$ Activation [2.3020018305241337]
We investigate the expressivity and approximation properties of deep networks employing the ReLU$k$ activation function for $k geq 2$. Although deep ReLU$k$ networks can approximates effectively, deep ReLU$k$ networks have the capability to represent higher-degrees precisely.
arXiv Detail & Related papers (2023-12-27T09:11:14Z)
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit [48.291961660957384]
We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers exhibit transfer of optimal hyper parameters across width and depth. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit.
arXiv Detail & Related papers (2023-09-28T17:20:50Z)
Feature-Learning Networks Are Consistent Across Widths At Realistic Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z)
Width and Depth Limits Commute in Residual Networks [26.97391529844503]
We show that taking the width and depth to infinity in a deep neural network with skip connections, results in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We conduct extensive simulations that show an excellent match with our theoretical findings.
arXiv Detail & Related papers (2023-02-01T13:57:32Z)
Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We show that linear networks make provably optimal predictions at infinite depth. We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
Adversarial Examples in Multi-Layer Random ReLU Networks [39.797621513256026]
adversarial examples arise in ReLU networks with independent gaussian parameters. Bottleneck layers in the network play a key role: the minimal width up to some point determines scales and sensitivities of mappings computed up to that point.
arXiv Detail & Related papers (2021-06-23T18:16:34Z)
Size and Depth Separation in Approximating Natural Functions with Neural Networks [52.73592689730044]
We show the benefits of size and depth for approximation of natural functions with ReLU networks. We show a complexity-theoretic barrier to proving such results beyond size $O(d)$. We also show an explicit natural function, that can be approximated with networks of size $O(d)$.
arXiv Detail & Related papers (2021-01-30T21:30:11Z)
Neural Networks with Small Weights and Depth-Separation Barriers [40.66211670342284]
For constant depths, existing results are limited to depths $2$ and $3$, and achieving results for higher depths has been an important question. We focus on feedforward ReLU networks, and prove fundamental barriers to proving such results beyond depth $4$.
arXiv Detail & Related papers (2020-05-31T21:56:17Z)
Quasi-Equivalence of Width and Depth of Neural Networks [10.365556153676538]
We investigate if the design of artificial neural networks should have a directional preference. Inspired by the De Morgan law, we establish a quasi-equivalence between the width and depth of ReLU networks. Based on our findings, a deep network has a wide equivalent, subject to an arbitrarily small error.
arXiv Detail & Related papers (2020-02-06T21:17:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.