An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
- URL: http://arxiv.org/abs/2505.08915v1
- Date: Tue, 13 May 2025 19:20:19 GMT
- Title: An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
- Authors: Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari,
- Abstract summary: Recent experiments have shown that training trajectories of multiple deep neural networks evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold.<n>Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter.<n>We show that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent.
- Score: 18.99511760351873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.
Related papers
- Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models [51.85815025140659]
Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data.<n>In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large gives rise to novel and sometimes counterintuitive behaviors.<n>This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models.
arXiv Detail & Related papers (2025-06-16T06:54:08Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model [0.0]
We study low-dimensional structure in the weights, Hessian's, gradients, and feature vectors of deep neural networks.
We show how they can be unified within a generalized unconstrained feature model.
arXiv Detail & Related papers (2024-04-09T08:17:32Z) - Weak Correlations as the Underlying Principle for Linearization of
Gradient-Based Learning Systems [1.0878040851638]
This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics.
We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function.
Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of gradient descent.
arXiv Detail & Related papers (2024-01-08T16:44:23Z) - Convergence Analysis for Learning Orthonormal Deep Linear Neural
Networks [27.29463801531576]
We provide convergence analysis for training orthonormal deep linear neural networks.
Our results shed light on how increasing the number of hidden layers can impact the convergence speed.
arXiv Detail & Related papers (2023-11-24T18:46:54Z) - From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity [54.01594785269913]
We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss.
The training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset.
arXiv Detail & Related papers (2023-09-28T15:19:30Z) - Autoencoders for discovering manifold dimension and coordinates in data
from complex dynamical systems [0.0]
Autoencoder framework combines implicit regularization with internal linear layers and $L$ regularization (weight decay)
We show that this framework can be naturally extended for applications of state-space modeling and forecasting.
arXiv Detail & Related papers (2023-05-01T21:14:47Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.