The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection
- URL: http://arxiv.org/abs/2510.21118v2
- Date: Mon, 27 Oct 2025 02:50:12 GMT
- Title: The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection
- Authors: Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo,
- Abstract summary: Existing benchmarks suffer from annotation ambiguity due to the ill-defined boundary of permissible external knowledge.<n>We propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent.<n>Using this framework, we construct VeriGray -- a new unfaithfulness detection benchmark in summarization.
- Score: 27.25270218197248
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as "faithful", yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) -- a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ($\sim 6\%$ of sentences) in summarization tasks. Moreover, a substantial proportion ($\sim 8\%$ on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.
Related papers
- Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation [1.405010905897415]
Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks.<n>We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation.<n>CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency.
arXiv Detail & Related papers (2026-01-27T08:30:13Z) - FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z) - Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation [108.13261761812517]
We introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs.<n>We present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness.
arXiv Detail & Related papers (2025-05-27T11:56:59Z) - ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs.<n> Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory.
arXiv Detail & Related papers (2025-02-21T15:50:41Z) - Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation [29.44609627447293]
We propose an approach to summary faithfulness evaluation in which multiple agents are assigned initial stances.<n>We introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases.<n>Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
arXiv Detail & Related papers (2025-02-12T15:46:50Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.