Related papers: Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

URL: http://arxiv.org/abs/2511.21401v1
Date: Wed, 26 Nov 2025 13:51:59 GMT
Title: Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
Authors: Antonín Jarolím, Martin Fajčík, Lucia Makaiová,
Abstract summary: This paper focuses on fine-grained evidence extraction for Czech and Slovak claims.<n>We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators.<n>We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

Related papers

Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries [2.853035319109148]
Large language models (LLMs) are increasingly used in newsrooms.<n>Their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy.<n>We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM.
arXiv Detail & Related papers (2025-09-29T20:55:43Z)
(Fact) Check Your Bias [0.0]
We investigate how parametric knowledge biases affect fact-checking outcomes of the HerO system (baseline for FEVER-25)<n>When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as "Not Enough Evidence"<n>In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50% of retrieved evidence being unique to each perspective.<n>Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies.
arXiv Detail & Related papers (2025-06-26T20:03:58Z)
Improving the fact-checking performance of language models by relying on their entailment ability [3.371541812350348]
We propose a simple yet effective strategy to train encoder-only language models (ELMs) for fact-checking.<n>We conducted a rigorous set of experiments, comparing our approach with recent works and various prompting and fine-tuning strategies to demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-05-21T03:15:06Z)
Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability [9.088303226909277]
In fact-checking, structure and phrasing of claims critically influence a model's ability to predict verdicts accurately.<n>We propose a self-adaptive approach to extract claims that is not reliant on labeled training data.<n>We show that this novel setup extracts a claim paraphrase that is more verifiable than their original social media formulations.
arXiv Detail & Related papers (2024-12-16T10:54:57Z)
Attribute or Abstain: Large Language Models as Long Document Assistants [58.32043134560244]
LLMs can help humans working with long documents, but are known to hallucinate. Existing approaches to attribution have only been evaluated in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where retrieval is not needed, but could help. We present LAB, a benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different sizes.
arXiv Detail & Related papers (2024-07-10T16:16:02Z)
CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses. We introduce CaLM, a novel verification framework. Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z)
Give Me More Details: Improving Fact-Checking with Latent Retrieval [58.706972228039604]
Evidence plays a crucial role in automated fact-checking. Existing fact-checking systems either assume the evidence sentences are given or use the search snippets returned by the search engine. We propose to incorporate full text from source documents as evidence and introduce two enriched datasets.
arXiv Detail & Related papers (2023-05-25T15:01:19Z)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z)
WiCE: Real-World Entailment for Claims in Wikipedia [63.234352061821625]
We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia. In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim. We show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.
arXiv Detail & Related papers (2023-03-02T17:45:32Z)
Making Document-Level Information Extraction Right for the Right Reasons [19.00249049142611]
Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. This work studies how to ensure that document-level neural models make correct inferences from complex text and make those inferences in an auditable way.
arXiv Detail & Related papers (2021-10-14T19:52:47Z)
AmbiFC: Fact-Checking Ambiguous Claims with Evidence [57.7091560922174]
We present AmbiFC, a fact-checking dataset with 10k claims derived from real-world information needs. We analyze disagreements arising from ambiguity when comparing claims against evidence in AmbiFC. We develop models for predicting veracity handling this ambiguity via soft labels.
arXiv Detail & Related papers (2021-04-01T17:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.