Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations
- URL: http://arxiv.org/abs/2602.06208v1
- Date: Thu, 05 Feb 2026 21:38:17 GMT
- Title: Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations
- Authors: Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano,
- Abstract summary: Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces.<n>This paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under descent (GD)<n>We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training.
- Score: 14.896223045496301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.
Related papers
- Gradient flow for deep equilibrium single-index models [32.2015869030351]
Deep equilibrium models (DEQs) have emerged as a powerful paradigm for training infinitely deep weight-tied neural networks.<n>We rigorously study the gradient descent dynamics for DEQs in the simple setting of linear models and single-index models.<n>We then prove linear convergence of gradient descent to a global minimizer for linear DEQs and deep equilibrium single-index models.
arXiv Detail & Related papers (2025-11-21T06:14:41Z) - VIKING: Deep variational inference with stochastic projections [48.946143517489496]
Variational mean field approximations tend to struggle with contemporary overparametrized deep neural networks.<n>We propose a simple variational family that considers two independent linear subspaces of the parameter space.<n>This allows us to build a fully-correlated approximate posterior reflecting the overparametrization.
arXiv Detail & Related papers (2025-10-27T15:38:35Z) - An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models [18.99511760351873]
Recent experiments have shown that training trajectories of multiple deep neural networks evolve on a remarkably low-dimensional "hyper-ribbon-like" manifold.<n>Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter.<n>We show that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent.
arXiv Detail & Related papers (2025-05-13T19:20:19Z) - An Overview of Low-Rank Structures in the Training and Adaptation of Large Models [52.67110072923365]
Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures.<n>These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models.<n>We present a comprehensive review of advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations.
arXiv Detail & Related papers (2025-03-25T17:26:09Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix
Factorization [21.64166573203593]
Implicit regularization is an important way to interpret neural networks.
Recent theory starts to explain implicit regularization with the model of deep matrix factorization (DMF)
arXiv Detail & Related papers (2022-12-29T02:11:19Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems [7.045072177165241]
We augment a piecewise-linear recurrent neural network (RNN) by a linear spline basis expansion.
We show that this approach retains all the theoretically appealing properties of the simple PLRNN, yet boosts its capacity for approximating arbitrary nonlinear dynamical systems in comparatively low dimensions.
arXiv Detail & Related papers (2022-07-06T09:43:03Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.