Related papers: Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks

URL: http://arxiv.org/abs/2205.13283v1
Date: Thu, 26 May 2022 11:42:44 GMT
Title: Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks
Authors: Zhiwei Bai, Tao Luo, Zhi-Qin John Xu, Yaoyu Zhang
Abstract summary: We prove an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifold.
Score: 3.5208869573271446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unraveling the general structure underlying the loss landscapes of deep neural networks (DNNs) is important for the theoretical study of deep learning. Inspired by the embedding principle of DNN loss landscape, we prove in this work an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. Specifically, we propose a critical lifting operator that any critical point of a shallower network can be lifted to a critical manifold of the target network while preserving the outputs. Through lifting, local minimum of an NN can become a strict saddle point of a deeper NN, which can be easily escaped by first-order methods. The embedding principle in depth reveals a large family of critical points in which layer linearization happens, i.e., computation of certain layers is effectively linear for the training inputs. We empirically demonstrate that, through suppressing layer linearization, batch normalization helps avoid the lifted critical manifolds, resulting in a faster decay of loss. We also demonstrate that increasing training data reduces the lifted critical manifold thus could accelerate the training. Overall, the embedding principle in depth well complements the embedding principle (in width), resulting in a complete characterization of the hierarchical structure of critical points/manifolds of a DNN loss landscape.

Related papers

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent [28.999394988111106]
We introduce a class of target functions that incorporate a hierarchy of latent subspace dimensionalities. Our main theorem shows that feature learning with gradient descent reduces the effective dimensionality. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.
arXiv Detail & Related papers (2025-02-19T18:58:28Z)
Component-based Sketching for Deep ReLU Nets [55.404661149594375]
We develop a sketching scheme based on deep net components for various tasks. We transform deep net training into a linear empirical risk minimization problem. We show that the proposed component-based sketching provides almost optimal rates in approximating saturated functions.
arXiv Detail & Related papers (2024-09-21T15:30:43Z)
NEPENTHE: Entropy-Based Pruning as a Neural Network Depth's Reducer [5.373015313199385]
We propose an eNtropy-basEd Pruning as a nEural Network depTH's rEducer to alleviate deep neural networks' computational burden. We validate our approach on popular architectures such as MobileNet and Swin-T.
arXiv Detail & Related papers (2024-04-24T09:12:04Z)
Peeking Behind the Curtains of Residual Learning [10.915277646160707]
"The Plain Neural Net Hypothesis" (PNNH) identifies the internal path across non-linear layers as the most critical part in residual learning. We thoroughly evaluate PNNH-enabled CNN architectures and Transformers on popular vision benchmarks, showing on-par accuracy, up to 0.3% higher training throughput, and 2x better parameter efficiency compared to ResNets and vision Transformers.
arXiv Detail & Related papers (2024-02-13T18:24:10Z)
Addressing caveats of neural persistence with deep graph persistence [54.424983583720675]
We find that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. We propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues.
arXiv Detail & Related papers (2023-07-20T13:34:11Z)
Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime. We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z)
Rank Diminishing in Deep Neural Networks [71.03777954670323]
Rank of neural networks measures information flowing across layers. It is an instance of a key structural condition that applies across broad domains of machine learning. For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear.
arXiv Detail & Related papers (2022-06-13T12:03:32Z)
Exact Solutions of a Deep Linear Network [2.2344764434954256]
This work finds the analytical expression of the global minima of a deep linear network with weight decay and neurons. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer.
arXiv Detail & Related papers (2022-02-10T00:13:34Z)
Neural Tangent Kernel Analysis of Deep Narrow Neural Networks [11.623483126242478]
We present the first trainability guarantee of infinitely deep but narrow neural networks. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
arXiv Detail & Related papers (2022-02-07T07:27:02Z)
Embedding Principle: a hierarchical structure of loss landscape of deep neural networks [3.0871079010101963]
We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) We provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs.
arXiv Detail & Related papers (2021-11-30T16:15:50Z)
Embedding Principle of Loss Landscape of Deep Neural Networks [1.1958610985612828]
We show that the loss landscape of a deep neural network (DNN) "contains" all the critical principle of all DNNs. We find that a wide DNN is often embedded by highlydegenerate critical points that are embedded from narrow DNNs.
arXiv Detail & Related papers (2021-05-30T15:32:32Z)
A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation. Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z)
Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group [4.56877715768796]
We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
arXiv Detail & Related papers (2020-12-07T20:08:31Z)
Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z)
On Random Kernels of Residual Architectures [93.94469470368988]
We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed.
arXiv Detail & Related papers (2020-01-28T16:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.