How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models
- URL: http://arxiv.org/abs/2509.19371v1
- Date: Fri, 19 Sep 2025 07:46:10 GMT
- Title: How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models
- Authors: Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng,
- Abstract summary: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks.<n>Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance.<n>We propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs.
- Score: 17.129300781943655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model's size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
Related papers
- CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models [42.12079243701232]
Causal Attention Tuning (CAT) is a novel approach that injects fine-grained causal knowledge into the attention mechanism.<n>We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals.<n>Cat achieves an average improvement of 5.76% on the STG dataset and 1.56% on downstream tasks.
arXiv Detail & Related papers (2025-09-01T15:13:15Z) - Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models [83.24079543652253]
Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization.<n>However, reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations.<n>We propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification.
arXiv Detail & Related papers (2025-05-30T14:23:32Z) - Rethinking the Outlier Distribution in Large Language Models: An In-depth Study [4.740962650068888]
Outliers often cause considerable quantization errors, leading to degraded model performance.<n>Recent studies have identified two common types of outliers in large language models: massive activations and channel-wise outliers.
arXiv Detail & Related papers (2025-05-27T18:48:40Z) - The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [40.35884943268004]
We show that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones.<n>There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios.<n>We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments.
arXiv Detail & Related papers (2025-04-24T17:39:25Z) - Scaling Laws for Data-Efficient Visual Transfer Learning [14.114908296325277]
This paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning.<n>We propose the distillation boundary theory, revealing a critical turning point in distillation efficiency.<n>This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation.
arXiv Detail & Related papers (2025-04-17T07:01:01Z) - Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - Can Large Language Models Help Experimental Design for Causal Discovery? [94.66802142727883]
Large Language Model Guided Intervention Targeting (LeGIT) is a robust framework that effectively incorporates LLMs to augment existing numerical approaches for the intervention targeting in causal discovery.<n>LeGIT demonstrates significant improvements and robustness over existing methods and even surpasses humans.
arXiv Detail & Related papers (2025-03-03T03:43:05Z) - Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase.
In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training.
We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z) - What Matters When Repurposing Diffusion Models for General Dense Perception Tasks? [49.84679952948808]
Recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks.<n>We conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors.<n>Our work culminates in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks.
arXiv Detail & Related papers (2024-03-10T04:23:24Z) - Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters.
We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z) - Discovery of the Hidden World with Large Language Models [95.58823685009727]
This paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap.
LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data.
COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors.
arXiv Detail & Related papers (2024-02-06T12:18:54Z) - On Memorization in Diffusion Models [44.031805633114985]
We show that memorization behaviors tend to occur on smaller-sized datasets.<n>We quantify the impact of the influential factors on these memorization behaviors in terms of effective model memorization (EMM)<n>Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models.
arXiv Detail & Related papers (2023-10-04T09:04:20Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.