Related papers: Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

URL: http://arxiv.org/abs/2111.15527v1
Date: Tue, 30 Nov 2021 16:15:50 GMT
Title: Embedding Principle: a hierarchical structure of loss landscape of deep neural networks
Authors: Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, Zhi-Qin John Xu
Abstract summary: We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) We provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs.
Score: 3.0871079010101963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) that unravels a hierarchical structure of the loss landscape of NNs, i.e., loss landscape of an NN contains all critical points of all the narrower NNs. This result is obtained by constructing a class of critical embeddings which map any critical point of a narrower NN to a critical point of the target NN with the same output function. By discovering a wide class of general compatible critical embeddings, we provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs. We further prove an irreversiblility property of any critical embedding that the number of negative/zero/positive eigenvalues of the Hessian matrix of a critical point may increase but never decrease as an NN becomes wider through the embedding. Using a special realization of general compatible critical embedding, we prove a stringent necessary condition for being a "truly-bad" critical point that never becomes a strict-saddle point through any critical embedding. This result implies the commonplace of strict-saddle points in wide NNs, which may be an important reason underlying the easy optimization of wide NNs widely observed in practice.

Related papers

A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks [2.048226951354646]
We theoretically analyze the expressivity of ternary NNs from the perspective of the number of linear regions.<n>We show that it suffices to either square the width or double the depth of ternary NNs to achieve a lower bound on the maximum number of linear regions comparable to that of general ReLU regression NNs.
arXiv Detail & Related papers (2025-07-21T21:29:33Z)
Neural Network Verification with Branch-and-Bound for General Nonlinearities [63.39918329535165]
Branch-and-bound (BaB) is among the most effective techniques for neural network (NN) verification. We develop a general framework, named GenBaB, to conduct BaB on general nonlinearities to verify NNs with general architectures. We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including NNs with activation functions such as Sigmoid, Tanh, Sine and GeLU.
arXiv Detail & Related papers (2024-05-31T17:51:07Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks [3.5208869573271446]
We prove an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifold.
arXiv Detail & Related papers (2022-05-26T11:42:44Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Deep Stable neural networks: large-width asymptotics and convergence rates [3.0108936184913295]
We show that as the width goes to infinity jointly over the NN's layers, a suitable rescaled deep Stable NN converges weakly to a Stable SP. Because of the non-triangular NN's structure, this is a non-standard problem, to which we propose a novel and self-contained inductive approach.
arXiv Detail & Related papers (2021-08-02T12:18:00Z)
The loss landscape of deep linear neural networks: a second-order analysis [9.85879905918703]
We study the optimization landscape of deep linear neural networks with the square loss. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points.
arXiv Detail & Related papers (2021-07-28T11:33:18Z)
Embedding Principle of Loss Landscape of Deep Neural Networks [1.1958610985612828]
We show that the loss landscape of a deep neural network (DNN) "contains" all the critical principle of all DNNs. We find that a wide DNN is often embedded by highlydegenerate critical points that are embedded from narrow DNNs.
arXiv Detail & Related papers (2021-05-30T15:32:32Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.