Embedding Principle: a hierarchical structure of loss landscape of deep
neural networks
- URL: http://arxiv.org/abs/2111.15527v1
- Date: Tue, 30 Nov 2021 16:15:50 GMT
- Title: Embedding Principle: a hierarchical structure of loss landscape of deep
neural networks
- Authors: Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, Zhi-Qin John Xu
- Abstract summary: We prove a general Embedding Principle of loss landscape of deep neural networks (NNs)
We provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs.
- Score: 3.0871079010101963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We prove a general Embedding Principle of loss landscape of deep neural
networks (NNs) that unravels a hierarchical structure of the loss landscape of
NNs, i.e., loss landscape of an NN contains all critical points of all the
narrower NNs. This result is obtained by constructing a class of critical
embeddings which map any critical point of a narrower NN to a critical point of
the target NN with the same output function. By discovering a wide class of
general compatible critical embeddings, we provide a gross estimate of the
dimension of critical submanifolds embedded from critical points of narrower
NNs. We further prove an irreversiblility property of any critical embedding
that the number of negative/zero/positive eigenvalues of the Hessian matrix of
a critical point may increase but never decrease as an NN becomes wider through
the embedding. Using a special realization of general compatible critical
embedding, we prove a stringent necessary condition for being a "truly-bad"
critical point that never becomes a strict-saddle point through any critical
embedding. This result implies the commonplace of strict-saddle points in wide
NNs, which may be an important reason underlying the easy optimization of wide
NNs widely observed in practice.
Related papers
- Neural Network Verification with Branch-and-Bound for General Nonlinearities [63.39918329535165]
Branch-and-bound (BaB) is among the most effective techniques for neural network (NN) verification.
We develop a general framework, named GenBaB, to conduct BaB on general nonlinearities to verify NNs with general architectures.
We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including NNs with activation functions such as Sigmoid, Tanh, Sine and GeLU.
arXiv Detail & Related papers (2024-05-31T17:51:07Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Embedding Principle in Depth for the Loss Landscape Analysis of Deep
Neural Networks [3.5208869573271446]
We prove an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs.
We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifold.
arXiv Detail & Related papers (2022-05-26T11:42:44Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Deep Stable neural networks: large-width asymptotics and convergence
rates [3.0108936184913295]
We show that as the width goes to infinity jointly over the NN's layers, a suitable rescaled deep Stable NN converges weakly to a Stable SP.
Because of the non-triangular NN's structure, this is a non-standard problem, to which we propose a novel and self-contained inductive approach.
arXiv Detail & Related papers (2021-08-02T12:18:00Z) - The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss.
We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z) - Embedding Principle of Loss Landscape of Deep Neural Networks [1.1958610985612828]
We show that the loss landscape of a deep neural network (DNN) "contains" all the critical principle of all DNNs.
We find that a wide DNN is often embedded by highlydegenerate critical points that are embedded from narrow DNNs.
arXiv Detail & Related papers (2021-05-30T15:32:32Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets.
Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity.
In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.