Related papers: Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

URL: http://arxiv.org/abs/2503.23361v1
Date: Sun, 30 Mar 2025 08:33:56 GMT
Title: Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
Authors: Linxin Song, Xuwei Ding, Jieyu Zhang, Taiwei Shi, Ryotaro Shimizu, Rahul Gupta, Yang Liu, Jian Kang, Jieyu Zhao,
Abstract summary: Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge.<n>We propose error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs.<n>SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher.
Score: 30.705524808195268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Related papers

Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection. Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases.<n>LLMs often struggle with question-answering tasks and are prone to hallucinations.<n>We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z)
Epistemic Integrity in Large Language Models [11.173637560124828]
Large language models are increasingly relied upon sources of information, but their propensity for false or misleading statements poses high risks for users and society. In this paper, we confront the critical problem of miscalibration where a model's linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human misalignment evaluation and a novel method for measuring the linguistic assertiveness of Large Language Models.
arXiv Detail & Related papers (2024-11-10T17:10:13Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
To Err is Machine: Vulnerability Detection Challenges LLM Reasoning [8.602355712876815]
We present a challenging code reasoning task: vulnerability detection.<n>State-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation.<n>New models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection.
arXiv Detail & Related papers (2024-03-25T21:47:36Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Discovery of the Hidden World with Large Language Models [95.58823685009727]
This paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap. LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data. COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors.
arXiv Detail & Related papers (2024-02-06T12:18:54Z)
Knowledge Verification to Nip Hallucination in the Bud [69.79051730580014]
We demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. We propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales.
arXiv Detail & Related papers (2024-01-19T15:39:49Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Pareto Optimal Learning for Estimating Large Language Model Errors [12.21899680905672]
Large Language Models (LLMs) have shown impressive abilities in many applications. We present a method that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information.
arXiv Detail & Related papers (2023-06-28T21:11:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.