Related papers: Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group

Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group

URL: http://arxiv.org/abs/2012.04030v1
Date: Mon, 7 Dec 2020 20:08:31 GMT
Title: Statistical Mechanics of Deep Linear Neural Networks: The Back-Propagating Renormalization Group
Authors: Qianyi Li, Haim Sompolinsky
Abstract summary: We study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. Our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth.
Score: 4.56877715768796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of deep learning in many real-world tasks has triggered an effort to theoretically understand the power and limitations of deep learning in training and generalization of complex tasks, so far with limited progress. In this work, we study the statistical mechanics of learning in Deep Linear Neural Networks (DLNNs) in which the input-output function of an individual unit is linear. Despite the linearity of the units, learning in DLNNs is highly nonlinear, hence studying its properties reveals some of the essential features of nonlinear Deep Neural Networks (DNNs). We solve exactly the network properties following supervised learning using an equilibrium Gibbs distribution in the weight space. To do this, we introduce the Back-Propagating Renormalization Group (BPRG) which allows for the incremental integration of the network weights layer by layer from the network output layer and progressing backward. This procedure allows us to evaluate important network properties such as its generalization error, the role of network width and depth, the impact of the size of the training set, and the effects of weight regularization and learning stochasticity. Furthermore, by performing partial integration of layers, BPRG allows us to compute the emergent properties of the neural representations across the different hidden layers. We have proposed a heuristic extension of the BPRG to nonlinear DNNs with rectified linear units (ReLU). Surprisingly, our numerical simulations reveal that despite the nonlinearity, the predictions of our theory are largely shared by ReLU networks with modest depth, in a wide regime of parameters. Our work is the first exact statistical mechanical study of learning in a family of Deep Neural Networks, and the first development of the Renormalization Group approach to the weight space of these systems.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent [28.999394988111106]
We introduce a class of target functions that incorporate a hierarchy of latent subspace dimensionalities. Our main theorem shows that feature learning with gradient descent reduces the effective dimensionality. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.
arXiv Detail & Related papers (2025-02-19T18:58:28Z)
The impact of allocation strategies in subset learning on the expressive power of neural networks [0.0]
We investigate how different allocations of a fixed number of learnable weights influence the capacity of neural networks.<n>We establish conditions under which allocations have maximal or minimal expressive power in linear recurrent neural networks and linear multilayer feedforward networks.<n>Our results emphasize the critical role of strategically distributing learnable weights across the network, showing that a more widespread allocation generally enhances the network's expressive power.
arXiv Detail & Related papers (2025-02-10T09:43:43Z)
Theoretical characterisation of the Gauss-Newton conditioning in Neural Networks [5.851101657703105]
We take a first step towards theoretically characterizing the conditioning of the Gauss-Newton (GN) matrix in neural networks. We establish tight bounds on the condition number of the GN in deep linear networks of arbitrary depth and width. We expand the analysis to further architectural components, such as residual connections and convolutional layers.
arXiv Detail & Related papers (2024-11-04T14:56:48Z)
Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse [14.817633094318253]
We study how architectural choices and structure of the data effect gradient rank bounds in deep neural networks (DNNs) Our theoretical analysis provides these bounds for training fully-connected, recurrent, and convolutional neural networks. We also demonstrate, both theoretically and empirically, how design choices like activation function linearity, bottleneck layer introduction, convolutional stride, and sequence truncation influence these bounds.
arXiv Detail & Related papers (2024-02-09T19:28:02Z)
Understanding Deep Neural Networks via Linear Separability of Hidden Layers [68.23950220548417]
We first propose Minkowski difference based linear separability measures (MD-LSMs) to evaluate the linear separability degree of two points sets. We demonstrate that there is a synchronicity between the linear separability degree of hidden layer outputs and the network training performance.
arXiv Detail & Related papers (2023-07-26T05:29:29Z)
Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights. We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z)
Globally Gated Deep Linear Networks [3.04585143845864]
We introduce Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer. We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit. Our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width.
arXiv Detail & Related papers (2022-10-31T16:21:56Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Characterizing Learning Dynamics of Deep Neural Networks via Complex Networks [1.0869257688521987]
Complex Network Theory (CNT) represents Deep Neural Networks (DNNs) as directed weighted graphs to study them as dynamical systems. We introduce metrics for nodes/neurons and layers, namely Nodes Strength and Layers Fluctuation. Our framework distills trends in the learning dynamics and separates low from high accurate networks.
arXiv Detail & Related papers (2021-10-06T10:03:32Z)
A Weight Initialization Based on the Linear Product Structure for Neural Networks [0.0]
We study neural networks from a nonlinear point of view and propose a novel weight initialization strategy that is based on the linear product structure (LPS) of neural networks. The proposed strategy is derived from the approximation of activation functions by using theories of numerical algebra to guarantee to find all the local minima.
arXiv Detail & Related papers (2021-09-01T00:18:59Z)
A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation. Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z)
How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution. Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.