Related papers: How Do Large Language Models Acquire Factual Knowledge During Pretraining?

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

URL: http://arxiv.org/abs/2406.11813v1
Date: Mon, 17 Jun 2024 17:54:40 GMT
Title: How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Authors: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo,
Abstract summary: We study how large language models (LLMs) acquire factual knowledge during pretraining. Findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining.
Score: 36.59608982935844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.

Related papers

Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models [47.82491185709275]
We analyze the evolution of factual knowledge representation in the OLMo-7B model.<n>Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses.
arXiv Detail & Related papers (2025-06-03T22:35:09Z)
Effective LLM Knowledge Learning via Model Generalization [73.16975077770765]
Large language models (LLMs) are trained on enormous documents that contain extensive world knowledge. It is still not well-understood how knowledge is acquired via autoregressive pre-training. In this paper, we focus on understanding and improving LLM knowledge learning.
arXiv Detail & Related papers (2025-03-05T17:56:20Z)
Learning Beyond the Surface: How Far Can Continual Pre-Training with LoRA Enhance LLMs' Domain-Specific Insight Learning? [4.390998479503661]
Large Language Models (LLMs) have demonstrated remarkable performance on various tasks. However, their ability to extract and internalize deeper insights from domain-specific datasets remains underexplored. This study investigates how continual pre-training can enhance LLMs' capacity for insight learning.
arXiv Detail & Related papers (2025-01-29T18:40:32Z)
Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge [55.65162959527848]
Large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information. This study systematically analyze LLMs' learning preferences for data with conflicting knowledge.
arXiv Detail & Related papers (2024-10-07T06:49:41Z)
On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models [33.08049246893537]
Retrieval augmented generation (RAG) exhibits outstanding performance in promoting the knowledge capabilities of large language models (LLMs) We propose a simple but effective long-tail knowledge detection method for LLMs. Our method achieves over 4x speedup in average inference time and consistent performance improvement in downstream tasks.
arXiv Detail & Related papers (2024-06-24T07:17:59Z)
Source-Aware Training Enables Knowledge Attribution in Language Models [81.13048060332775]
Intrinsic source citation can enhance transparency, interpretability, and verifiability. Our training recipe can enable faithful attribution to the pretraining data without a substantial impact on the model's perplexity.
arXiv Detail & Related papers (2024-04-01T09:39:38Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge. This study empirically evaluates the forgetting phenomenon in large language models (LLMs) during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
Knowledge Inheritance for Pre-trained Language Models [57.51305807391381]
We introduce a novel pre-training framework named "knowledge inheritance" (KI) KI combines both self-learning and teacher-guided learning to efficiently train larger PLMs. We show that KI can well support lifelong learning and knowledge transfer.
arXiv Detail & Related papers (2021-05-28T14:43:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.