Non-Vacuous Generalization Bounds for Large Language Models
- URL: http://arxiv.org/abs/2312.17173v3
- Date: Wed, 17 Jul 2024 15:32:47 GMT
- Title: Non-Vacuous Generalization Bounds for Large Language Models
- Authors: Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, Andrew Gordon Wilson,
- Abstract summary: We provide the first non-vacuous generalization bounds for pretrained large language models.
We show that larger models have better generalization bounds and are more compressible than smaller models.
- Score: 78.42762571499061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation by orders of magnitude on massive datasets. To achieve the extreme level of compression required for non-vacuous bounds, we devise SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for models with nearly a billion parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models.
Related papers
- Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models [79.70436109672599]
We derive non-vacuous generalization bounds for large language models as large as LLaMA2-70B.
Our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
arXiv Detail & Related papers (2024-07-25T16:13:58Z) - Slicing Mutual Information Generalization Bounds for Neural Networks [14.48773730230054]
We introduce new, tighter information-theoretic generalization bounds tailored for deep learning algorithms.
Our bounds offer significant computational and statistical advantages over standard MI bounds.
We extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces.
arXiv Detail & Related papers (2024-06-06T13:15:37Z) - Compressing Sentence Representation with maximum Coding Rate Reduction [0.0]
In most natural language inference problems, sentence representation is needed for semantic retrieval tasks.
Due to space and time hardware limitations, there is a need to attain comparable results when using the smaller model.
We demonstrate that the new language model with reduced complexity and sentence embedding size can achieve comparable results on semantic retrieval benchmarks.
arXiv Detail & Related papers (2023-04-25T09:23:43Z) - PAC-Bayes Compression Bounds So Tight That They Can Explain
Generalization [48.26492774959634]
We develop a compression approach based on quantizing neural network parameters in a linear subspace.
We find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor.
arXiv Detail & Related papers (2022-11-24T13:50:16Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Low-Rank Constraints for Fast Inference in Structured Models [110.38427965904266]
This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models.
Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces.
arXiv Detail & Related papers (2022-01-08T00:47:50Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z) - The Predictive Normalized Maximum Likelihood for Over-parameterized
Linear Regression with Norm Constraint: Regret and Double Descent [12.929639356256928]
We show that modern machine learning models do not obey a trade-off between the complexity of a prediction rule and its ability to generalize.
We use the recently proposed predictive normalized maximum likelihood (pNML) which is the min-max regret solution for individual data.
We demonstrate the use of the pNML regret as a point-wise learnability measure on synthetic data and that it can successfully predict the double-decent phenomenon.
arXiv Detail & Related papers (2021-02-14T15:49:04Z) - Intrinsic Dimensionality Explains the Effectiveness of Language Model
Fine-Tuning [52.624194343095304]
We argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions.
We empirically show that common pre-trained models have a very low intrinsic dimension.
arXiv Detail & Related papers (2020-12-22T07:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.