Related papers: Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

URL: http://arxiv.org/abs/2505.16178v1
Date: Thu, 22 May 2025 03:34:29 GMT
Title: Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge
Authors: Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Ryoma Ishigaki, Yuta Hitomi, Kentaro Inui,
Abstract summary: We investigate how training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts.<n>Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters.
Score: 21.798525556259378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

Related papers

Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers [0.0]
We investigate the relationship between memorization and generalization in large language models.<n>Small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate.<n>Findings suggest that pre-training may intrinsically favor one learning mode over the other.
arXiv Detail & Related papers (2025-06-10T14:49:33Z)
How do language models learn facts? Dynamics, curricula and hallucinations [22.693703460345873]
Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood.<n>This work investigates the learning dynamics of language models on a synthetic factual recall task.
arXiv Detail & Related papers (2025-03-27T16:43:45Z)
Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.<n> SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.<n>We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z)
Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.<n>Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Co-occurrence is not Factual Association in Language Models [19.708303468664088]
We show that language models are biased to learn word co-occurrence statistics instead of true factual associations. We propose two strategies to improve the learning of factual associations in language models.
arXiv Detail & Related papers (2024-09-21T08:13:16Z)
Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications. Memorisation is the causal effect of training with an instance on the model's ability to predict that instance. This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z)
Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning [113.58691755215663]
We develop RetroPrompt to help a model strike a balance between generalization and memorization. In contrast with vanilla prompt learning, RetroPrompt constructs an open-book knowledge-store from training instances. Extensive experiments demonstrate that RetroPrompt can obtain better performance in both few-shot and zero-shot settings.
arXiv Detail & Related papers (2022-05-29T16:07:30Z)
Imbalanced Adversarial Training with Reweighting [33.51820466479575]
We show that adversarially trained models can suffer much worse performance on under-represented classes, when the training dataset is imbalanced. Traditional reweighting strategies may lose efficacy to deal with the imbalance issue for adversarial training. We propose Separable Reweighted Adversarial Training (SRAT) to facilitate adversarial training under imbalanced scenarios.
arXiv Detail & Related papers (2021-07-28T20:51:36Z)
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? [80.2534918595143]
Mutual information provides an appealing formalism for learning representations of data. This paper formalizes the sufficiency of a state representation for learning and representing the optimal policy. Surprisingly, we find that two of these objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP.
arXiv Detail & Related papers (2021-06-14T10:12:34Z)
Low-Resource Knowledge-Grounded Dialogue Generation [74.09352261943913]
We consider knowledge-grounded dialogue generation under a natural assumption that only limited training examples are available. We devise a disentangled response decoder in order to isolate parameters that depend on knowledge-grounded dialogues from the entire generation model. With only 1/8 training data, our model can achieve the state-of-the-art performance and generalize well on out-of-domain knowledge.
arXiv Detail & Related papers (2020-02-24T16:20:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.