Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
- URL: http://arxiv.org/abs/2411.05375v2
- Date: Fri, 18 Jul 2025 14:38:50 GMT
- Title: Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
- Authors: Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos,
- Abstract summary: Evtextsuperscript2R combines the strengths of reference-based evaluation and verdict-level proxy scoring.<n>Evtextsuperscript2R consistently outperforms existing scoring approaches in accuracy and robustness.
- Score: 11.300523252168327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}
Related papers
- Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision [25.382800247901827]
DeepfakeJudge is a framework for scalable reasoning supervision and evaluation.<n>It integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models.
arXiv Detail & Related papers (2026-02-23T11:08:46Z) - Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation [21.864019348357303]
Large language models (LLMs) are increasingly used as automatic judges for question answering (QA)<n>We show that when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity.<n>We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict.
arXiv Detail & Related papers (2026-01-12T13:05:13Z) - Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z) - Reconstructing Trust Embeddings from Siamese Trust Scores: A Direct-Sum Approach with Fixed-Point Semantics [0.0]
We study the inverse problem of reconstructing high-dimensional trust embeddings from the one-dimensional Siamese trust scores that many distributed-security frameworks expose.<n>A suite of synthetic benchmarks confirms that, even in the presence of Gaussian noise, the recovered embeddings preserve inter-device geometry as measured by Euclidean and cosine metrics.<n>The paper demonstrates a practical privacy risk: publishing granular trust scores can leak latent behavioural information about both devices and evaluation models.
arXiv Detail & Related papers (2025-08-02T20:19:22Z) - Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs.
We propose a new evaluation methodology that accounts for the groundedness of predictions.
Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z) - ClaimTrust: Propagation Trust Scoring for RAG Systems [7.7690689135107425]
ClaimTrust is a propagation-based trust scoring framework that dynamically evaluates the reliability of documents in a RAG system.<n>We preprocess and analyze 814 political news articles to extract 2,173 unique claims and classify 965 meaningful relationships.<n>ClaimTrust iteratively updates trust scores until convergence, effectively differentiating trustworthy articles from unreliable ones.
arXiv Detail & Related papers (2025-03-12T07:52:24Z) - DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering [12.879551933541345]
We propose the Dynamic Arbitration Framework for Evaluation (DAFE) to evaluate large language models.
DAFE employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements.
We show DAFE's ability to provide consistent, scalable, and resource-efficient assessments.
arXiv Detail & Related papers (2025-03-11T15:29:55Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.
Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.
We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity [27.92468098611616]
We propose two novel semantic-based approaches for assessing code reviews.<n>The first approach involves converting both the generated review and its reference into digital vectors using a deep learning model.<n>The second approach generates a prompt based on the generated review and its reference, submits this prompt to ChatGPT, and requests ChatGPT to rate the generated review.
arXiv Detail & Related papers (2025-01-09T11:52:32Z) - Contrastive Learning to Improve Retrieval for Real-world Fact Checking [84.57583869042791]
We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for fact-checking complex claims.
We leverage the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents.
We find a 6% improvement in veracity classification accuracy on the dataset.
arXiv Detail & Related papers (2024-10-07T00:09:50Z) - Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks [17.520137576423593]
We aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR)
We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them.
We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR.
arXiv Detail & Related papers (2024-08-29T17:55:07Z) - Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models.
In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.
We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - DEE: Dual-stage Explainable Evaluation Method for Text Generation [21.37963672432829]
We introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation.
Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts.
The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria.
arXiv Detail & Related papers (2024-03-18T06:30:41Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation [30.674896082482476]
We show that Op-I-Prompt emerges as a good alternative for evaluating opinion summaries achieving an average Spearman correlation of 0.70 with humans.
To the best of our knowledge, we are the first to investigate LLMs as evaluators on both closed-source and open-source models in the opinion summarization domain.
arXiv Detail & Related papers (2024-02-18T19:13:52Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.<n>SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Plugin estimators for selective classification with out-of-distribution
detection [67.28226919253214]
Real-world classifiers can benefit from abstaining from predicting on samples where they have low confidence.
These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature.
Recent work on selective classification with OOD detection has argued for the unified study of these problems.
We propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches.
arXiv Detail & Related papers (2023-01-29T07:45:17Z) - OpenOOD: Benchmarking Generalized Out-of-Distribution Detection [60.13300701826931]
Out-of-distribution (OOD) detection is vital to safety-critical machine learning applications.
The field currently lacks a unified, strictly formulated, and comprehensive benchmark.
We build a unified, well-structured called OpenOOD, which implements over 30 methods developed in relevant fields.
arXiv Detail & Related papers (2022-10-13T17:59:57Z) - From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z) - Realistic Evaluation Principles for Cross-document Coreference
Resolution [19.95214898312209]
We argue that models should not exploit the synthetic topic structure of the standard ECB+ dataset.
We demonstrate empirically the drastic impact of our more realistic evaluation principles on a competitive model.
arXiv Detail & Related papers (2021-06-08T09:05:21Z) - Posthoc Verification and the Fallibility of the Ground Truth [10.427125361534966]
We conduct a systematic posthoc verification experiment on the entity linking (EL) task.
Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology.
Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth.
arXiv Detail & Related papers (2021-06-02T17:57:09Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.