Provable Generalization in Overparameterized Neural Nets
- URL: http://arxiv.org/abs/2508.17256v1
- Date: Sun, 24 Aug 2025 08:46:31 GMT
- Title: Provable Generalization in Overparameterized Neural Nets
- Authors: Aviral Dhingra,
- Abstract summary: Deep neural networks often contain far more parameters than training examples, yet they still generalize well in practice.<n>I explore an alternative notion of capacity for attention-based models, based on the effective rank of their attention matrices.<n>I show that this quantity leads to a generalization bound whose dependence on sample size matches empirical scaling laws observed in large language models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks often contain far more parameters than training examples, yet they still manage to generalize well in practice. Classical complexity measures such as VC-dimension or PAC-Bayes bounds usually become vacuous in this overparameterized regime, offering little explanation for the empirical success of models like Transformers. In this work, I explore an alternative notion of capacity for attention-based models, based on the effective rank of their attention matrices. The intuition is that, although the parameter count is enormous, the functional dimensionality of attention is often much lower. I show that this quantity leads to a generalization bound whose dependence on sample size matches empirical scaling laws observed in large language models, up to logarithmic factors. While the analysis is not a complete theory of overparameterized learning, it provides evidence that spectral properties of attention, rather than raw parameter counts, may be the right lens for understanding why these models generalize.
Related papers
- The Universality Lens: Why Even Highly Over-Parametrized Models Learn Well [4.2466572124752995]
We study a Bayesian mixture with log-loss and (almost) uniform prior over an expansive hypothesis class.<n>Key result shows that the learner's regret is not determined by the overall size of the hypothesis class.<n>Results apply broadly across online, batch, and supervised learning settings.
arXiv Detail & Related papers (2025-06-09T11:32:31Z) - Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - The $\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control [23.293525050286224]
We consider norm-based capacity measures and develop our study for random features based estimators.<n>We provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error.<n>This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size.
arXiv Detail & Related papers (2025-02-03T18:10:40Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
We present a unifying perspective on recent results on ridge regression.<n>We use the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.<n>Our results extend and provide a unifying perspective on earlier models of scaling laws.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - PAC-Bayes Compression Bounds So Tight That They Can Explain
Generalization [48.26492774959634]
We develop a compression approach based on quantizing neural network parameters in a linear subspace.
We find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor.
arXiv Detail & Related papers (2022-11-24T13:50:16Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Bandwidth Enables Generalization in Quantum Kernel Models [16.940180366663903]
Recent results show that generalization of quantum models is hindered by the exponential size of the quantum feature space.
We show that changing the value of the bandwidth can take a model from provably not being able to generalize to any target function to good generalization for well-aligned targets.
arXiv Detail & Related papers (2022-06-14T08:41:08Z) - More Than a Toy: Random Matrix Models Predict How Real-World Neural
Representations Generalize [94.70343385404203]
We find that most theoretical analyses fall short of capturing qualitative phenomena even for kernel regression.
We prove that the classical GCV estimator converges to the generalization risk whenever a local random matrix law holds.
Our findings suggest that random matrix theory may be central to understanding the properties of neural representations in practice.
arXiv Detail & Related papers (2022-03-11T18:59:01Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Benign overfitting in ridge regression [0.0]
We provide non-asymptotic generalization bounds for overparametrized ridge regression.
We identify when small or negative regularization is sufficient for obtaining small generalization error.
arXiv Detail & Related papers (2020-09-29T20:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.