Why Deep Learning Generalizes
- URL: http://arxiv.org/abs/2211.09639v1
- Date: Thu, 17 Nov 2022 16:39:43 GMT
- Title: Why Deep Learning Generalizes
- Authors: Benjamin L. Badger
- Abstract summary: We find that memorization is difficult relative to generalization, but that adding noise memorization makes memorization easier.
We show that generalization results from a model's parameters being attracted to points of maximal stability with respect to that model's inputs during gradient descent.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Very large deep learning models trained using gradient descent are remarkably
resistant to memorization given their huge capacity, but are at the same time
capable of fitting large datasets of pure noise. Here methods are introduced by
which models may be trained to memorize datasets that normally are generalized.
We find that memorization is difficult relative to generalization, but that
adding noise makes memorization easier. Increasing the dataset size exaggerates
the characteristics of that dataset: model access to more training samples
makes overfitting easier for random data, but somewhat harder for natural
images. The bias of deep learning towards generalization is explored
theoretically, and we show that generalization results from a model's
parameters being attracted to points of maximal stability with respect to that
model's inputs during gradient descent.
Related papers
- The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z) - Non-Vacuous Generalization Bounds for Large Language Models [78.42762571499061]
We provide the first non-vacuous generalization bounds for pretrained large language models.
We show that larger models have better generalization bounds and are more compressible than smaller models.
arXiv Detail & Related papers (2023-12-28T17:58:42Z) - Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z) - What do larger image classifiers memorise? [64.01325988398838]
We show that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes.
We find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation.
arXiv Detail & Related papers (2023-10-09T01:52:07Z) - Phantom Embeddings: Using Embedding Space for Model Regularization in
Deep Neural Networks [12.293294756969477]
The strength of machine learning models stems from their ability to learn complex function approximations from data.
The complex models tend to memorize the training data, which results in poor regularization performance on test data.
We present a novel approach to regularize the models by leveraging the information-rich latent embeddings and their high intra-class correlation.
arXiv Detail & Related papers (2023-04-14T17:15:54Z) - The Curious Case of Benign Memorization [19.74244993871716]
We show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way.
We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers.
arXiv Detail & Related papers (2022-10-25T13:41:31Z) - Memorization Without Overfitting: Analyzing the Training Dynamics of
Large Language Models [64.22311189896888]
We study exact memorization in causal and masked language modeling, across model sizes and throughout the training process.
Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
arXiv Detail & Related papers (2022-05-22T07:43:50Z) - Contrasting random and learned features in deep Bayesian linear
regression [12.234742322758418]
We study how the ability to learn affects the generalization performance of a simple class of models.
By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch.
arXiv Detail & Related papers (2022-03-01T15:51:29Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.