How do language models learn facts? Dynamics, curricula and hallucinations
- URL: http://arxiv.org/abs/2503.21676v1
- Date: Thu, 27 Mar 2025 16:43:45 GMT
- Title: How do language models learn facts? Dynamics, curricula and hallucinations
- Authors: Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De,
- Abstract summary: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood.<n>This work investigates the learning dynamics of language models on a synthetic factual recall task.
- Score: 22.693703460345873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.
Related papers
- Spurious Forgetting in Continual Learning of Language Models [20.0936011355535]
Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning.<n>Despite extensive training, models experience significant performance declines.<n>This study proposes that such performance drops often reflect a decline in task alignment rather than true knowledge loss.
arXiv Detail & Related papers (2025-01-23T08:09:54Z) - Developmental Predictive Coding Model for Early Infancy Mono and Bilingual Vocal Continual Learning [69.8008228833895]
We propose a small-sized generative neural network equipped with a continual learning mechanism.<n>Our model prioritizes interpretability and demonstrates the advantages of online learning.
arXiv Detail & Related papers (2024-12-23T10:23:47Z) - Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase.
In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training.
We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z) - The mechanistic basis of data dependence and abrupt learning in an
in-context classification task [0.3626013617212666]
We show that specific distributional properties inherent in language control the trade-off or simultaneous appearance of two forms of learning.
In-context learning is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning.
We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL.
arXiv Detail & Related papers (2023-12-03T20:53:41Z) - Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism
of Language Models [49.39276272693035]
Large-scale pre-trained language models have shown remarkable memorizing ability.
Vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem.
We find that 1) Vanilla language models are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation.
arXiv Detail & Related papers (2023-05-16T03:50:38Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Anti-Retroactive Interference for Lifelong Learning [65.50683752919089]
We design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain.
It tackles the problem from two aspects: extracting knowledge and memorizing knowledge.
It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum.
arXiv Detail & Related papers (2022-08-27T09:27:36Z) - DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for
Natural Language Understanding [19.478288026844893]
Knowledge-Enhanced Pre-trained Language Models (KEPLMs) are pre-trained models with relation triples injecting from knowledge graphs to improve language understanding abilities.
Previous studies integrate models with knowledge encoders for representing knowledge retrieved from knowledge graphs.
We propose a novel KEPLM named DKPLM that Decomposes Knowledge injection process of the Pre-trained Language Models in pre-training, fine-tuning and inference stages.
arXiv Detail & Related papers (2021-12-02T08:19:42Z) - Training Dynamics for Text Summarization Models [45.62439188988816]
We analyze the training dynamics for generation models, focusing on news summarization.
Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, we study what the model learns at different stages of its fine-tuning process.
We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains.
On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains.
arXiv Detail & Related papers (2021-10-15T21:13:41Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.