Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
- URL: http://arxiv.org/abs/2411.05375v1
- Date: Fri, 08 Nov 2024 07:05:06 GMT
- Title: Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
- Authors: Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos,
- Abstract summary: Ev2R is an evaluation framework for automated fact-checking (AFC)
It comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less.
We evaluate their effectiveness through agreement with human ratings and adversarial tests.
- Score: 11.300523252168327
- License:
- Abstract: Current automated fact-checking (AFC) approaches commonly evaluate evidence either implicitly via the predicted verdicts or by comparing retrieved evidence with a predefined closed knowledge source, such as Wikipedia. However, these methods suffer from limitations, resulting from their reliance on evaluation metrics developed for different purposes and constraints imposed by closed knowledge sources. Recent advances in natural language generation (NLG) evaluation offer new possibilities for evidence assessment. In this work, we introduce Ev2R, an evaluation framework for AFC that comprises three types of approaches for evidence evaluation: reference-based, proxy-reference, and reference-less. We evaluate their effectiveness through agreement with human ratings and adversarial tests, and demonstrate that prompt-based scorers, particularly those leveraging LLMs and reference evidence, outperform traditional evaluation approaches.
Related papers
- Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning [5.065947993017158]
In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings.
We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.
arXiv Detail & Related papers (2025-02-20T17:40:21Z) - Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity [27.92468098611616]
We propose two novel semantic-based approaches for assessing code reviews.
The first approach involves converting both the generated review and its reference into digital vectors using a deep learning model.
The second approach generates a prompt based on the generated review and its reference, submits this prompt to ChatGPT, and requests ChatGPT to rate the generated review.
arXiv Detail & Related papers (2025-01-09T11:52:32Z) - Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models.
In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.
We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - DEE: Dual-stage Explainable Evaluation Method for Text Generation [21.37963672432829]
We introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation.
Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts.
The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria.
arXiv Detail & Related papers (2024-03-18T06:30:41Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Plugin estimators for selective classification with out-of-distribution
detection [67.28226919253214]
Real-world classifiers can benefit from abstaining from predicting on samples where they have low confidence.
These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature.
Recent work on selective classification with OOD detection has argued for the unified study of these problems.
We propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches.
arXiv Detail & Related papers (2023-01-29T07:45:17Z) - OpenOOD: Benchmarking Generalized Out-of-Distribution Detection [60.13300701826931]
Out-of-distribution (OOD) detection is vital to safety-critical machine learning applications.
The field currently lacks a unified, strictly formulated, and comprehensive benchmark.
We build a unified, well-structured called OpenOOD, which implements over 30 methods developed in relevant fields.
arXiv Detail & Related papers (2022-10-13T17:59:57Z) - From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z) - Posthoc Verification and the Fallibility of the Ground Truth [10.427125361534966]
We conduct a systematic posthoc verification experiment on the entity linking (EL) task.
Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology.
Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth.
arXiv Detail & Related papers (2021-06-02T17:57:09Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.