Related papers: Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems

Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems

URL: http://arxiv.org/abs/2003.00777v2
Date: Mon, 20 Jul 2020 10:49:10 GMT
Title: Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems
Authors: Vaggos Chatziafratis and Sai Ganesh Nagarajan and Ioannis Panageas
Abstract summary: Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems. We improve the existing width lower bounds along several aspects. A byproduct of our results is that there exists a universal constant characterizing the depth-width trade-offs.
Score: 24.229336600210015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The expressivity of neural networks as a function of their depth, width and type of activation units has been an important question in deep learning theory. Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems, using a generalized notion of fixed points of a continuous map $f$, called periodic points. In this work, we strengthen the connection with dynamical systems and we improve the existing width lower bounds along several aspects. Our first main result is period-specific width lower bounds that hold under the stronger notion of $L^1$-approximation error, instead of the weaker classification error. Our second contribution is that we provide sharper width lower bounds, still yielding meaningful exponential depth-width separations, in regimes where previous results wouldn't apply. A byproduct of our results is that there exists a universal constant characterizing the depth-width trade-offs, as long as $f$ has odd periods. Technically, our results follow by unveiling a tighter connection between the following three quantities of a given function: its period, its Lipschitz constant and the growth rate of the number of oscillations arising under compositions of the function $f$ with itself.

Related papers

Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets [58.460298576330835]
We study Leaky ResNets, which interpolate between ResNets ($tildeLtoinfty$) and Fully-Connected nets ($tildeLtoinfty$) In the infinite depth limit, we study'representation geodesics' $A_p$: continuous paths in representation space (similar to NeuralODEs) We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work.
arXiv Detail & Related papers (2024-05-27T18:15:05Z)
Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization [5.678271181959529]
We study the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.
arXiv Detail & Related papers (2023-02-20T01:30:27Z)
Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We show that linear networks make provably optimal predictions at infinite depth. We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z)
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent. We show that SGD is biased towards a simple solution. We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z)
Expressivity of Neural Networks via Chaotic Itineraries beyond Sharkovsky's Theorem [8.492084752803528]
Given a target function $f$, how large must a neural network be in order to approximate $f$? Recent works examine this basic question on neural network textitexpressivity'' from the lens of dynamical systems.
arXiv Detail & Related papers (2021-10-19T22:28:27Z)
Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks. We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Learning Deep ReLU Networks Is Fixed-Parameter Tractable [21.625005195943707]
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs. We give an algorithm whose running time is a fixed weights in the ambient dimension. Our bounds depend on the number of hidden units, depth, spectral norm of the spectral norm, and Lipschitz constant.
arXiv Detail & Related papers (2020-09-28T17:58:43Z)
Doubly infinite residual neural networks: a diffusion process approach [8.642603456626393]
We show that deep ResNets do not suffer from undesirable forward-propagation properties. We focus on doubly infinite fully-connected ResNets, for which we consider i.i.d. Our results highlight a limited expressive power of doubly infinite ResNets when the unscaled network's parameters are i.i.d. and the residual blocks are shallow.
arXiv Detail & Related papers (2020-07-07T07:45:34Z)
Depth Enables Long-Term Memory for Recurrent Neural Networks [0.0]
We introduce a measure of the network's ability to support information flow across time, referred to as the Start-End separation rank. We prove that deep recurrent networks support Start-End separation ranks which are higher than those supported by their shallow counterparts.
arXiv Detail & Related papers (2020-03-23T10:29:14Z)
On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.