How Do Large Language Models Acquire Factual Knowledge During Pretraining?
- URL: http://arxiv.org/abs/2406.11813v1
- Date: Mon, 17 Jun 2024 17:54:40 GMT
- Title: How Do Large Language Models Acquire Factual Knowledge During Pretraining?
- Authors: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo,
- Abstract summary: We study how large language models (LLMs) acquire factual knowledge during pretraining.
Findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining.
- Score: 36.59608982935844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We investigate the interplay between generalization and memorization in large language models at scale.
With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important.
Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Source-Aware Training Enables Knowledge Attribution in Language Models [81.13048060332775]
Large language models (LLMs) learn a vast amount of knowledge during pretraining, but they are often oblivious to the source(s) of such knowledge.
We investigate the problem of intrinsic source citation, where LLMs are required to cite the pretraining source supporting a generated response.
Our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's quality compared to standard pretraining.
arXiv Detail & Related papers (2024-04-01T09:39:38Z) - Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period
of Large Language Models [49.48324619809122]
We pioneer the exploration of LLM's trustworthiness during pre-training.
We focus on five key dimensions: reliability, privacy, toxicity, fairness, and robustness.
We are the first to observe a similar two-phase phenomenon: fitting and compression.
arXiv Detail & Related papers (2024-02-29T18:55:06Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge.
This study empirically evaluates the forgetting phenomenon in large language models (LLMs) during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z) - Measuring and Modifying Factual Knowledge in Large Language Models [2.8427946758947304]
Large Language Models store an extensive amount of factual knowledge obtained from vast collections of text.
We employ information theory-based measurements to provide a framework estimating the factual knowledge contained within large language models.
arXiv Detail & Related papers (2023-06-09T21:25:48Z) - Knowledge Inheritance for Pre-trained Language Models [57.51305807391381]
We introduce a novel pre-training framework named "knowledge inheritance" (KI)
KI combines both self-learning and teacher-guided learning to efficiently train larger PLMs.
We show that KI can well support lifelong learning and knowledge transfer.
arXiv Detail & Related papers (2021-05-28T14:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.