Related papers: Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

URL: http://arxiv.org/abs/2410.09005v1
Date: Fri, 11 Oct 2024 17:21:42 GMT
Title: Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra
Authors: Roman Worschech, Bernd Rosenow,
Abstract summary: Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time. We employ techniques from statistical mechanics to analyze one-pass gradient descent within a student-teacher framework.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

Related papers

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime [34.77547342230355]
We present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime.<n>We derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay.<n>We provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance.
arXiv Detail & Related papers (2025-09-29T14:58:13Z)
Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models [51.85815025140659]
Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data.<n>In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large gives rise to novel and sometimes counterintuitive behaviors.<n>This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models.
arXiv Detail & Related papers (2025-06-16T06:54:08Z)
Models of Heavy-Tailed Mechanistic Universality [62.107333654304014]
We propose a family of random matrix models to explore attributes that give rise to heavy-tailed behavior in trained neural networks.<n>Under this model, spectral densities with power laws on tails arise through a combination of three independent factors.<n> Implications of our model on other appearances of heavy tails, including neural scaling laws, trajectories, and the five-plus-one phases of neural network training, are discussed.
arXiv Detail & Related papers (2025-06-04T00:55:01Z)
Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks [11.365318749216739]
We study the entire training dynamics through the lens of spectral complexity norms.<n>We identify two novel dynamical scaling laws that govern how performance evolves during training.<n>Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2025-05-19T15:13:36Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Cross-Entropy Is All You Need To Invert the Data Generating Process [29.94396019742267]
Empirical phenomena suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning have shown that these methods can recover latent structures by inverting the data generating process. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation.
arXiv Detail & Related papers (2024-10-29T09:03:57Z)
A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z)
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process. We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z)
A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws. We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z)
Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces [13.36217184117654]
We show that regularizing the spectral representation of machine learning models improves their generalization power when labeled data is scarce. Running gradient descent on the regularized loss results in a better generalization performance compared to baseline algorithms in several data-scarce real-world problems.
arXiv Detail & Related papers (2022-10-05T23:31:54Z)
Curvature-informed multi-task learning for graph networks [56.155331323304]
State-of-the-art graph neural networks attempt to predict multiple properties simultaneously. We investigate a potential explanation for this phenomenon: the curvature of each property's loss surface significantly varies, leading to inefficient learning.
arXiv Detail & Related papers (2022-08-02T18:18:41Z)
Universal scaling laws in the gradient descent training of neural networks [10.508187462682308]
We show that the learning trajectory can be characterized by an explicit bounds at large training times. Our results are based on spectral analysis of the evolution of a large network trained on the expected loss.
arXiv Detail & Related papers (2021-05-02T16:46:38Z)
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory [110.99247009159726]
Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise.
arXiv Detail & Related papers (2020-06-08T17:25:22Z)
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity. Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.