Scaling Truth: The Confidence Paradox in AI Fact-Checking
- URL: http://arxiv.org/abs/2509.08803v1
- Date: Wed, 10 Sep 2025 17:36:25 GMT
- Title: Scaling Truth: The Confidence Paradox in AI Fact-Checking
- Authors: Ihsan A. Qazi, Zohaib Khan, Abdullah Ghani, Agha A. Raza, Zafar A. Qazi, Wassay Sajjad, Ayesha Ali, Asher Javaid, Muhammad Abdullah Sohail, Abdul H. Azeemi,
- Abstract summary: Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain.<n>We systematically evaluate nine established LLMs across multiple categories using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages.<n>Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence.
- Score: 0.8201655885319955
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The rise of misinformation underscores the need for scalable and reliable fact-checking solutions. Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain. We systematically evaluate nine established LLMs across multiple categories (open/closed-source, multiple sizes, diverse architectures, reasoning-based) using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages. Our methodology tests model generalizability on claims postdating training cutoffs and four prompting strategies mirroring both citizen and professional fact-checker interactions, with over 240,000 human annotations as ground truth. Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, accessible models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence. This risks systemic bias in information verification, as resource-constrained organizations typically use smaller models. Performance gaps are most pronounced for non-English languages and claims originating from the Global South, threatening to widen existing information inequalities. These results establish a multilingual benchmark for future research and provide an evidence base for policy aimed at ensuring equitable access to trustworthy, AI-assisted fact-checking.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety [1.4517170578045737]
Critical meaning errors can undermine reliability, fairness, and safety of multilingual systems.<n>In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such errors.<n>Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements.
arXiv Detail & Related papers (2026-02-11T23:47:39Z) - Can Large Language Models Express Uncertainty Like Human? [71.27418419522884]
We release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores.<n>We conduct the first systematic study of linguistic confidence across modern large language models.
arXiv Detail & Related papers (2025-09-29T02:34:30Z) - Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability [1.1135113962297134]
This paper introduces a novel, dynamically data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024.<n>We evaluate five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B.<n>Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability.
arXiv Detail & Related papers (2025-06-04T07:47:21Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.<n>Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.<n>This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - Epistemic Integrity in Large Language Models [10.50127599111102]
Large language models are increasingly relied upon sources of information, but their propensity for false or misleading statements poses high risks for users and society.<n>In this paper, we confront the critical problem of miscalibration where a model's linguistic assertiveness fails to reflect its true internal certainty.<n>We introduce a new human misalignment evaluation and a novel method for measuring the linguistic assertiveness of Large Language Models.
arXiv Detail & Related papers (2024-11-10T17:10:13Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.