Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers
- URL: http://arxiv.org/abs/2506.09099v2
- Date: Tue, 17 Jun 2025 19:17:29 GMT
- Title: Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers
- Authors: Joshua Barron, Devin White,
- Abstract summary: We investigate the relationship between memorization and generalization in large language models.<n>Small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate.<n>Findings suggest that pre-training may intrinsically favor one learning mode over the other.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.
Related papers
- Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
We study the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models.<n>Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities.
arXiv Detail & Related papers (2025-06-05T16:10:47Z) - How much do language models memorize? [104.22411031699127]
We separate memorization into two components: textitunintended memorization'' and textitgeneralization''<n>When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity.<n>We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point grokking'' begins, and unintended memorization decreases as models begin to generalize.
arXiv Detail & Related papers (2025-05-30T17:34:03Z) - Small Models, Smarter Learning: The Power of Joint Task Training [23.91574901740084]
Our study focuses on the ListOps dataset, which consists of nested mathematical operations.<n>We observe that sum modulo n is the hardest to learn.<n>We show that joint training not only improves performance but also leads to qualitatively different model behavior.
arXiv Detail & Related papers (2025-05-23T20:56:37Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models.
These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains.
Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z) - Emergent and Predictable Memorization in Large Language Models [23.567027014457775]
Memorization, or the tendency of large language models to output entire sequences from their training data verbatim, is a key concern for safely deploying language models.
We seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs.
We provide further novel discoveries on the distribution of memorization scores across models and data.
arXiv Detail & Related papers (2023-04-21T17:58:31Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.