Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs
- URL: http://arxiv.org/abs/2502.08909v1
- Date: Thu, 13 Feb 2025 02:51:17 GMT
- Title: Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs
- Authors: Premtim Sahitaj, Iffat Maab, Junichi Yamagishi, Jawan Kolanowski, Sebastian Möller, Vera Schmitt,
- Abstract summary: This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs)
We evaluate Llama-3 models of varying sizes on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches.
Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning.
- Score: 32.45604456988931
- License:
- Abstract: Fact-checking is necessary to address the increasing volume of misinformation. Traditional fact-checking relies on manual analysis to verify claims, but it is slow and resource-intensive. This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs) across multiple labeling schemes (binary, three-class, five-class) and extends traditional claim verification by incorporating analysis, verdict classification, and explanation in a structured setup to provide comprehensive justifications for real-world claims. We evaluate Llama-3 models of varying sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches. We utilize TIGERScore as a reference-free evaluation metric to score the justifications. Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning. We find that smaller LLMs in a one-shot scenario provide comparable task performance to fine-tuned Small Language Models (SLMs) with large context sizes, while larger LLMs consistently surpass them. Evidence integration improves performance across all models, with larger LLMs benefiting most. Distinguishing between nuanced labels remains challenging, emphasizing the need for further exploration of labeling schemes and alignment with evidences. Our findings demonstrate the potential of retrieval-augmented AFC with LLMs.
Related papers
- Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues.
We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts.
By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.
In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance [31.38160018745285]
Small language models (SLMs) present a more accessible alternative to large language models (LLMs)
This paper presents a comprehensive evaluation of 19 SLMs for news summarization across 2,000 news samples.
arXiv Detail & Related papers (2025-02-02T03:07:45Z) - IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark [22.238377215355545]
We introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format.
We observe a significant performance gap between the state-of-the-art sub-10B open models vs. closed ones.
The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-11-12T01:05:55Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks [15.60762281287532]
Large Language Models (LLMs) are revolutionizing various domains, yet verifying their answers remains a significant challenge.
In this work, we propose CheckEmbed: an accurate, scalable, and simple LLM verification approach.
CheckEmbed is driven by a straightforward yet powerful idea: compare their corresponding answer-level embeddings obtained with a model such as GPT Text Embedding Large.
arXiv Detail & Related papers (2024-06-04T17:42:21Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.