Related papers: Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search

Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search

URL: http://arxiv.org/abs/2511.18749v1
Date: Mon, 24 Nov 2025 04:22:32 GMT
Title: Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search
Authors: Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer,
Abstract summary: We evaluate 15 recent large language models (LLMs) on more than 6,000 claims fact-checked by PolitiFact.<n>Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains.<n>A curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants.
Score: 3.282845873351502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

Related papers

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z)
ClaimCheck: Real-Time Fact-Checking with Small Language Models [5.305110876082343]
ClaimCheck is an LLM-guided automatic fact-checking system designed to verify real-world claims.<n>Unlike prior systems that rely on large, closed-source models, ClaimCheck employs a transparent, stepwise verification pipeline.<n>Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking.
arXiv Detail & Related papers (2025-09-22T21:18:08Z)
Scaling Truth: The Confidence Paradox in AI Fact-Checking [0.8201655885319955]
Large language models (LLMs) hold promise in automating fact verification, yet their effectiveness across global contexts remains uncertain.<n>We systematically evaluate nine established LLMs across multiple categories using 5,000 claims previously assessed by 174 professional fact-checking organizations across 47 languages.<n>Findings reveal a concerning pattern resembling the Dunning-Kruger effect: smaller, models show high confidence despite lower accuracy, while larger models demonstrate higher accuracy but lower confidence.
arXiv Detail & Related papers (2025-09-10T17:36:25Z)
A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages [1.3331869040581863]
This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers.<n>Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations.<n>Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.
arXiv Detail & Related papers (2025-04-29T11:49:05Z)
Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation [2.4073494101588273]
This study presents a novel and effective prompt formulation approach for logical fallacy detection.<n>Our method enriches input text incorporating implicit contextual information, which we query for validity within the context of the argument.<n>We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types.
arXiv Detail & Related papers (2025-03-30T08:41:09Z)
One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books. Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z)
Long-form factuality in large language models [60.07181269469043]
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. We benchmark a model's long-form factuality in open domains, using GPT-4 to generate LongFact. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE)
arXiv Detail & Related papers (2024-03-27T17:48:55Z)
Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection. We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z)
Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return. Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions. The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.