A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable
Optimization Via Overparameterization From Depth
- URL: http://arxiv.org/abs/2003.05508v2
- Date: Thu, 11 Jun 2020 19:10:15 GMT
- Title: A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable
Optimization Via Overparameterization From Depth
- Authors: Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, Lexing Ying
- Abstract summary: Training deep neural networks with gradient descent (SGD) can often achieve zero training loss on real-world landscapes.
We propose a new limit of infinity deep residual networks, which enjoys a good training in the sense that everyr is global.
- Score: 19.866928507243617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training deep neural networks with stochastic gradient descent (SGD) can
often achieve zero training loss on real-world tasks although the optimization
landscape is known to be highly non-convex. To understand the success of SGD
for training deep neural networks, this work presents a mean-field analysis of
deep residual networks, based on a line of works that interpret the continuum
limit of the deep residual network as an ordinary differential equation when
the network capacity tends to infinity. Specifically, we propose a new
continuum limit of deep residual networks, which enjoys a good landscape in the
sense that every local minimizer is global. This characterization enables us to
derive the first global convergence result for multilayer neural networks in
the mean-field regime. Furthermore, without assuming the convexity of the loss
landscape, our proof relies on a zero-loss assumption at the global minimizer
that can be achieved when the model shares a universal approximation property.
Key to our result is the observation that a deep residual network resembles a
shallow network ensemble, i.e. a two-layer network. We bound the difference
between the shallow network and our ResNet model via the adjoint sensitivity
method, which enables us to apply existing mean-field analyses of two-layer
networks to deep networks. Furthermore, we propose several novel training
schemes based on the new continuous model, including one training procedure
that switches the order of the residual blocks and results in strong empirical
performance on the benchmark datasets.
Related papers
- Efficient Training of Deep Neural Operator Networks via Randomized Sampling [0.0]
Deep operator network (DeepNet) has demonstrated success in the real-time prediction of complex dynamics across various scientific and engineering applications.
We introduce a random sampling technique to be adopted the training of DeepONet, aimed at improving generalization ability of the model, while significantly reducing computational time.
Our results indicate that incorporating randomization in the trunk network inputs during training enhances the efficiency and robustness of DeepONet, offering a promising avenue for improving the framework's performance in modeling complex physical systems.
arXiv Detail & Related papers (2024-09-20T07:18:31Z) - Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Fixing the NTK: From Neural Network Linearizations to Exact Convex
Programs [63.768739279562105]
We show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data.
A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set.
arXiv Detail & Related papers (2023-09-26T17:42:52Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Implicit regularization of deep residual networks towards neural ODEs [8.075122862553359]
We establish an implicit regularization of deep residual networks towards neural ODEs.
We prove that if the network is as a discretization of a neural ODE, then such a discretization holds throughout training.
arXiv Detail & Related papers (2023-09-03T16:35:59Z) - Singular Value Perturbation and Deep Network Optimization [29.204852309828006]
We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network.
In particular, we explain what deep learning practitioners have long observed empirically: the parameters of some deep architectures are easier to optimize than others.
A direct application of our perturbation results explains analytically why a ResNet is easier to optimize than a ConvNet.
arXiv Detail & Related papers (2022-03-07T02:09:39Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural
Networks: an Exact Characterization of the Optimal Solutions [51.60996023961886]
We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints.
Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces.
arXiv Detail & Related papers (2020-06-10T15:38:30Z) - Variational Depth Search in ResNets [2.6763498831034043]
One-shot neural architecture search allows joint learning of weights and network architecture, reducing computational cost.
We limit our search space to the depth of residual networks and formulate an analytically tractable variational objective that allows for an unbiased approximate posterior over depths in one-shot.
We compare our proposed method against manual search over network depths on the MNIST, Fashion-MNIST, SVHN datasets.
arXiv Detail & Related papers (2020-02-06T16:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.