Weak Correlations as the Underlying Principle for Linearization of
Gradient-Based Learning Systems
- URL: http://arxiv.org/abs/2401.04013v1
- Date: Mon, 8 Jan 2024 16:44:23 GMT
- Title: Weak Correlations as the Underlying Principle for Linearization of
Gradient-Based Learning Systems
- Authors: Ori Shem-Ur, Yaron Oz
- Abstract summary: This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics.
We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function.
Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of gradient descent.
- Score: 1.0878040851638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning models, such as wide neural networks, can be conceptualized as
nonlinear dynamical physical systems characterized by a multitude of
interacting degrees of freedom. Such systems in the infinite limit, tend to
exhibit simplified dynamics. This paper delves into gradient descent-based
learning algorithms, that display a linear structure in their parameter
dynamics, reminiscent of the neural tangent kernel. We establish this apparent
linearity arises due to weak correlations between the first and higher-order
derivatives of the hypothesis function, concerning the parameters, taken around
their initial values. This insight suggests that these weak correlations could
be the underlying reason for the observed linearization in such systems. As a
case in point, we showcase this weak correlations structure within neural
networks in the large width limit. Exploiting the relationship between
linearity and weak correlations, we derive a bound on deviations from linearity
observed during the training trajectory of stochastic gradient descent. To
facilitate our proof, we introduce a novel method to characterise the
asymptotic behavior of random tensors.
Related papers
- Minimum-Norm Interpolation Under Covariate Shift [14.863831433459902]
In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as textitbenign overfitting
We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting.
arXiv Detail & Related papers (2024-03-31T01:41:57Z) - Feature learning as alignment: a structural property of gradient descent in non-linear neural networks [13.032185349152492]
We show that the neural feature ansatz (NFA) becomes correlated during training.
We establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features.
We prove the derivative alignment occurs almost surely in specific high dimensional settings.
arXiv Detail & Related papers (2024-02-07T21:31:53Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations [114.17826109037048]
Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning.
theoretical aspects, e.g., identifiability and properties of statistical estimation are still obscure.
This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory.
arXiv Detail & Related papers (2022-10-12T06:46:38Z) - Phenomenology of Double Descent in Finite-Width Neural Networks [29.119232922018732]
Double descent delineates the behaviour of models depending on the regime they belong to.
We use influence functions to derive suitable expressions of the population loss and its lower bound.
Building on our analysis, we investigate how the loss function affects double descent.
arXiv Detail & Related papers (2022-03-14T17:39:49Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.