Related papers: Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

URL: http://arxiv.org/abs/2602.14044v2
Date: Mon, 23 Feb 2026 22:32:36 GMT
Title: Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness
Authors: Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini,
Abstract summary: We show that large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.<n>We evaluate both factual knowledge and the impact of evidence placement across varying context lengths.
Score: 6.250095470690937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.

Related papers

AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains [3.721111684544962]
Hallucination in large language models (LLMs) contributes to spread of misinformation and diminished public trust.<n>We introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality.<n>We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates.
arXiv Detail & Related papers (2026-01-21T22:47:59Z)
Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs [0.0]
Large language models (LLMs) increasingly support very long input contexts.<n>It remains unclear how reliably they extract and infer information at scale.<n>We study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior.
arXiv Detail & Related papers (2026-01-05T11:30:56Z)
Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
When Context Leads but Parametric Memory Follows in Large Language Models [4.567122178196834]
Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions.
arXiv Detail & Related papers (2024-09-13T00:03:19Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Retrieval Augmented Fact Verification by Synthesizing Contrastive Arguments [23.639378586798884]
We propose retrieval augmented fact verification through the synthesis of contrasting arguments. Our method effectively retrieves relevant documents as evidence and evaluates arguments from varying perspectives. We demonstrate the effectiveness of our method through extensive experiments, where RAFTS can outperform GPT-based methods with a significantly smaller 7B LLM.
arXiv Detail & Related papers (2024-06-14T08:13:34Z)
Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering [14.389264346634507]
We propose EFSum, an Evidence-focused Fact Summarization framework for enhanced Quesetion Answering (QA) performance. Our experiments show that EFSum improves LLM's zero-shot QA performance.
arXiv Detail & Related papers (2024-03-05T13:43:58Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts. Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks. We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.