Enhancing LLMs via High-Knowledge Data Selection
- URL: http://arxiv.org/abs/2505.14070v1
- Date: Tue, 20 May 2025 08:21:37 GMT
- Title: Enhancing LLMs via High-Knowledge Data Selection
- Authors: Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai,
- Abstract summary: The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data.<n>We propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge.<n>We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks.
- Score: 13.769398867340296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.
Related papers
- Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning [83.99974309930072]
Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models.<n>We propose a Knowledge-aware Data Selection framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs.<n>By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance.
arXiv Detail & Related papers (2025-05-28T04:18:24Z) - LEKA:LLM-Enhanced Knowledge Augmentation [24.552995956148145]
Humans excel in analogical learning and knowledge transfer.<n>Models would transition from passively acquiring to actively accessing and learning from knowledge.<n>We develop a knowledge augmentation method LEKA for knowledge transfer.
arXiv Detail & Related papers (2025-01-29T17:44:57Z) - KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning [74.21524111840652]
This paper proposes textbfKaLM, a textitKnowledge-aligned Language Modeling approach.<n>It fine-tunes autoregressive large language models to align with KG knowledge via the joint objective of explicit knowledge alignment and implicit knowledge alignment.<n> Notably, our method achieves a significant performance boost in evaluations of knowledge-driven tasks.
arXiv Detail & Related papers (2024-12-06T11:08:24Z) - KBAlign: Efficient Self Adaptation on Specific Knowledge Bases [73.34893326181046]
We present KBAlign, a self-supervised framework that enhances RAG systems through efficient model adaptation.<n>Our key insight is to leverage the model's intrinsic capabilities for knowledge alignment through two innovative mechanisms.<n> Experiments demonstrate that KBAlign can achieve 90% of the performance gain obtained through GPT-4-supervised adaptation.
arXiv Detail & Related papers (2024-11-22T08:21:03Z) - Large Language Models are Limited in Out-of-Context Knowledge Reasoning [65.72847298578071]
Large Language Models (LLMs) possess extensive knowledge and strong capabilities in performing in-context reasoning.
This paper focuses on a significant aspect of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge.
arXiv Detail & Related papers (2024-06-11T15:58:59Z) - Beyond Factuality: A Comprehensive Evaluation of Large Language Models
as Knowledge Generators [78.63553017938911]
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks.
However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge.
We introduce CONNER, designed to evaluate generated knowledge from six important perspectives.
arXiv Detail & Related papers (2023-10-11T08:22:37Z) - Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models [46.079902719883414]
We propose Knowledge Card, a modular framework to plug in new factual and relevant knowledge into general-purpose language models.
We first introduce knowledge cards -- specialized language models trained on corpora from specific domains and sources.
We then propose three content selectors to dynamically select and retain information in documents generated by knowledge cards.
arXiv Detail & Related papers (2023-05-17T05:25:27Z) - UNTER: A Unified Knowledge Interface for Enhancing Pre-trained Language
Models [100.4659557650775]
We propose a UNified knowledge inTERface, UNTER, to provide a unified perspective to exploit both structured knowledge and unstructured knowledge.
With both forms of knowledge injected, UNTER gains continuous improvements on a series of knowledge-driven NLP tasks.
arXiv Detail & Related papers (2023-05-02T17:33:28Z) - LM-CORE: Language Models with Contextually Relevant External Knowledge [13.451001884972033]
We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements.
We present LM-CORE -- a general framework to achieve this -- that allows textitdecoupling of the language model training from the external knowledge source.
Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks.
arXiv Detail & Related papers (2022-08-12T18:59:37Z) - Informed Learning by Wide Neural Networks: Convergence, Generalization
and Sampling Complexity [27.84415856657607]
We study how and why domain knowledge benefits the performance of informed learning.
We propose a generalized informed training objective to better exploit the benefits of knowledge and balance the label and knowledge imperfectness.
arXiv Detail & Related papers (2022-07-02T06:28:25Z) - Sequential Latent Knowledge Selection for Knowledge-Grounded Dialogue [51.513276162736844]
We propose a sequential latent variable model as the first approach to this matter.
The model named sequential knowledge transformer (SKT) can keep track of the prior and posterior distribution over knowledge.
arXiv Detail & Related papers (2020-02-18T11:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.