Models of Heavy-Tailed Mechanistic Universality
- URL: http://arxiv.org/abs/2506.03470v1
- Date: Wed, 04 Jun 2025 00:55:01 GMT
- Title: Models of Heavy-Tailed Mechanistic Universality
- Authors: Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney,
- Abstract summary: We propose a family of random matrix models to explore attributes that give rise to heavy-tailed behavior in trained neural networks.<n>Under this model, spectral densities with power laws on tails arise through a combination of three independent factors.<n> Implications of our model on other appearances of heavy tails, including neural scaling laws, trajectories, and the five-plus-one phases of neural network training, are discussed.
- Score: 62.107333654304014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept of heavy-tailed mechanistic universality (HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models -- the high-temperature Marchenko-Pastur (HTMP) ensemble -- to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion" parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.
Related papers
- Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms [6.375329734462518]
This paper proposes the "Cognitive Activation" theory, revealing the essence of Large Language Models' reasoning mechanisms.<n> Experiments show that the model's information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output.<n>This research provides a chaos theory framework for the interpretability of LLMs' reasoning and reveals potential pathways for balancing creativity and reliability in model design.
arXiv Detail & Related papers (2025-03-15T08:15:10Z) - Explosive neural networks via higher-order interactions in curved statistical manifolds [43.496401697112695]
We introduce curved neural networks as a class of prototypical models with a limited number of parameters.<n>We show that these curved neural networks implement a self-regulating process that can accelerate memory retrieval.<n>We analytically explore their memory-retrieval capacity using the replica trick near ferromagnetic and spin-glass phase boundaries.
arXiv Detail & Related papers (2024-08-05T09:10:29Z) - Relational Learning in Pre-Trained Models: A Theory from Hypergraph Recovery Perspective [60.64922606733441]
We introduce a mathematical model that formalizes relational learning as hypergraph recovery to study pre-training of Foundation Models (FMs)
In our framework, the world is represented as a hypergraph, with data abstracted as random samples from hyperedges. We theoretically examine the feasibility of a Pre-Trained Model (PTM) to recover this hypergraph and analyze the data efficiency in a minimax near-optimal style.
arXiv Detail & Related papers (2024-06-17T06:20:39Z) - The twin peaks of learning neural networks [3.382017614888546]
Recent works demonstrated the existence of a double-descent phenomenon for the generalization error of neural networks.
We explore a link between this phenomenon and the increase of complexity and sensitivity of the function represented by neural networks.
arXiv Detail & Related papers (2024-01-23T10:09:14Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Equivariant vector field network for many-body system modeling [65.22203086172019]
Equivariant Vector Field Network (EVFN) is built on a novel equivariant basis and the associated scalarization and vectorization layers.
We evaluate our method on predicting trajectories of simulated Newton mechanics systems with both full and partially observed data.
arXiv Detail & Related papers (2021-10-26T14:26:25Z) - Tensor networks for unsupervised machine learning [9.897828174118974]
We present the Autoregressive Matrix Product States (AMPS), a tensor-network-based model combining the matrix product states from quantum many-body physics and the autoregressive models from machine learning.
We show that the proposed model significantly outperforms the existing tensor-network-based models and the restricted Boltzmann machines.
arXiv Detail & Related papers (2021-06-24T12:51:00Z) - On Energy-Based Models with Overparametrized Shallow Neural Networks [44.74000986284978]
Energy-based models (EBMs) are a powerful framework for generative modeling.
In this work we focus on shallow neural networks.
We show that models trained in the so-called "active" regime provide a statistical advantage over their associated "lazy" or kernel regime.
arXiv Detail & Related papers (2021-04-15T15:34:58Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z) - Learning the Ising Model with Generative Neural Networks [0.0]
We study the representational characteristics of Boltzmann machines (RBMs) and variational autoencoders (VAEs)
Our results suggest that the considered RBMs and convolutional VAEs are able to capture the temperature dependence of magnetization, energy, and spin-spin correlations.
We also find that convolutional layers in VAEs are important to model spin correlations whereas RBMs achieve similar or even better performances without convolutional filters.
arXiv Detail & Related papers (2020-01-15T15:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.