Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models
- URL: http://arxiv.org/abs/2509.23233v1
- Date: Sat, 27 Sep 2025 10:32:41 GMT
- Title: Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models
- Authors: Sina J. Semnani, Jirayu Burapacheep, Arpandeep Khatua, Thanawan Atchariyachanvanit, Zheng Wang, Monica S. Lam,
- Abstract summary: We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection.<n>We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review.<n>In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.
- Score: 11.16952630564181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.
Related papers
- FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding [81.2130536158575]
LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance.<n>We introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions.
arXiv Detail & Related papers (2026-02-03T11:44:11Z) - Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles [56.724847946825285]
We introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references.<n>We propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability.
arXiv Detail & Related papers (2026-02-02T03:30:13Z) - Factual Inconsistencies in Multilingual Wikipedia Tables [5.395647076142643]
This study investigates cross-lingual inconsistencies in Wikipedia's structured content.<n>We develop a methodology to collect, align, and analyze tables from Wikipedia multilingual articles.<n>These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems.
arXiv Detail & Related papers (2025-07-24T13:46:14Z) - Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection [48.188285483378664]
We introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention.<n>We propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's "Did You Know..." entries.<n>WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly with future updates from Wikipedia editors.
arXiv Detail & Related papers (2025-05-18T08:39:05Z) - Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [16.506990103937515]
We stress test a range of automatic factuality metrics to probe what they actually capture.<n>We find that all metrics show substantial performance drops on the latter.<n>Some metrics are more sensitive to benign, fact-preserving edits than to factual corrections.
arXiv Detail & Related papers (2024-11-25T18:15:15Z) - What Really is Commonsense Knowledge? [58.5342212738895]
We survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a unified definition of commonsense knowledge.
We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets.
Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets.
arXiv Detail & Related papers (2024-11-06T14:54:19Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia [57.31074448586854]
Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context.
Yet the mechanisms underlying this contextual grounding remain unknown.
We present a novel method to study grounding abilities using Fakepedia.
arXiv Detail & Related papers (2023-12-04T17:35:42Z) - WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia [14.325320851640084]
We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.
Each sentence is extracted from the entire revision history of English Wikipedia.
WikiSQE has about 3.4 M sentences with 153 quality labels.
arXiv Detail & Related papers (2023-05-10T06:45:13Z) - Vera: A General-Purpose Plausibility Estimation Model for Commonsense
Statements [135.09277663808322]
We introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge.
trained on 7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases.
We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.
arXiv Detail & Related papers (2023-05-05T17:15:32Z) - Longitudinal Assessment of Reference Quality on Wikipedia [7.823541290904653]
This work analyzes the reliability of this global encyclopedia through the lens of its references.
We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references.
arXiv Detail & Related papers (2023-03-09T13:04:14Z) - ComFact: A Benchmark for Linking Contextual Commonsense Knowledge [31.19689856957576]
We propose the new task of commonsense fact linking, where models are given contexts and trained to identify situationally-relevant commonsense knowledge from KGs.
Our novel benchmark, ComFact, contains 293k in-context relevance annotations for commonsense across four stylistically diverse datasets.
arXiv Detail & Related papers (2022-10-23T09:30:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.