Related papers: Evaluating Evidence Attribution in Generated Fact Checking Explanations

Evaluating Evidence Attribution in Generated Fact Checking Explanations

URL: http://arxiv.org/abs/2406.12645v2
Date: Wed, 16 Oct 2024 18:23:39 GMT
Title: Evaluating Evidence Attribution in Generated Fact Checking Explanations
Authors: Rui Xing, Timothy Baldwin, Jey Han Lau,
Abstract summary: We introduce a novel evaluation protocol, citation masking and recovery, to assess attribution quality in generated explanations. Experiments reveal that the best-performing LLMs still generate explanations with inaccurate attributions. Human-curated evidence is essential for generating better explanations.
Score: 48.776087871960584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated fact-checking systems often struggle with trustworthiness, as their generated explanations can include hallucinations. In this work, we explore evidence attribution for fact-checking explanation generation. We introduce a novel evaluation protocol, citation masking and recovery, to assess attribution quality in generated explanations. We implement our protocol using both human annotators and automatic annotators, and find that LLM annotation correlates with human annotation, suggesting that attribution assessment can be automated. Finally, our experiments reveal that: (1) the best-performing LLMs still generate explanations with inaccurate attributions; and (2) human-curated evidence is essential for generating better explanations. Code and data are available here: https://github.com/ruixing76/Transparent-FCExp.

Related papers

MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation [11.430206422495829]
Multi-Aspect Driven LLM Agent MADRec is an autonomous recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews.<n>MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs.<n>Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability.
arXiv Detail & Related papers (2025-10-15T10:03:29Z)
VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.75781898355562]
We introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution.<n>We conduct experiments across five open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
arXiv Detail & Related papers (2025-10-13T13:38:54Z)
GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs [6.3596531375179515]
This paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide evidence-based response generation.<n> Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent.<n>We leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval.
arXiv Detail & Related papers (2025-05-15T10:17:35Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification. We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality. Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties [12.71326587869053]
Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods to introduce counterfactual explanations for anomaly detection. A key advantage of this approach is that it enables a domain-independent formal specification of explainability desiderata.
arXiv Detail & Related papers (2024-10-31T17:43:53Z)
Comparing zero-shot self-explanations with human rationales in multilingual text classification [5.32539007352208]
Instruction-tuned LLMs generate self-explanations that do not require computations or the application of possibly complex XAI methods. We analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales. Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
arXiv Detail & Related papers (2024-10-04T10:14:12Z)
Evaluating the Reliability of Self-Explanations in Large Language Models [2.8894038270224867]
We evaluate two kinds of such self-explanations - extractive and counterfactual. Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results.
arXiv Detail & Related papers (2024-07-19T17:41:08Z)
Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [75.10515686215177]
Large Language Models (LLMs) excel in text generation, but their capability for producing faithful explanations in fact-checking remains underexamined. We propose the Multi-Agent Debate Refinement (MADR) framework, leveraging multiple LLMs as agents with diverse roles. MADR ensures that the final explanation undergoes rigorous validation, significantly reducing the likelihood of unfaithful elements and aligning closely with the provided evidence.
arXiv Detail & Related papers (2024-02-12T04:32:33Z)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z)
Re-Examining Human Annotations for Interpretable NLP [80.81532239566992]
We conduct controlled experiments using crowd-sourced websites on two widely used datasets in Interpretable NLP. We compare the annotation results obtained from recruiting workers satisfying different levels of qualification. Our results reveal that the annotation quality is highly subject to the workers' qualification, and workers can be guided to provide certain annotations by the instructions.
arXiv Detail & Related papers (2022-04-10T02:27:30Z)
Generating Fluent Fact Checking Explanations with Unsupervised Post-Editing [22.5444107755288]
We present an iterative edit-based algorithm that uses only phrase-level edits to perform unsupervised post-editing of ruling comments. We show that our model generates explanations that are fluent, readable, non-redundant, and cover important information for the fact check.
arXiv Detail & Related papers (2021-12-13T15:31:07Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)
Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process. This paper provides the first study of how these explanations can be generated automatically based on available claim context. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.