Related papers: Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

URL: http://arxiv.org/abs/2411.16638v4
Date: Wed, 05 Nov 2025 17:42:36 GMT
Title: Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Authors: Sanjana Ramprasad, Byron C. Wallace,
Abstract summary: We stress test a range of automatic factuality metrics to probe what they actually capture.<n>We find that all metrics show substantial performance drops on the latter.<n>Some metrics are more sensitive to benign, fact-preserving edits than to factual corrections.
Score: 16.506990103937515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i.e., information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate ``easy'' examples for factual evaluation where surface features suffice from ``hard'' cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed, i.e., their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the prompt based ChatGPT-DA approach is the most robust and reliable. However, this comes with a notable caveat: Prompting LLMs to assess factuality may overly rely on their parametric knowledge rather than the provided reference when making judgments. Taken together, our findings call into question the reliability of current factuality metrics and prompt a broader reflection on what these metrics are truly measuring.

Related papers

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Position: All Current Generative Fidelity and Diversity Metrics are Flawed [58.815519650465774]
We show that all current generative fidelity and diversity metrics are flawed.<n>Our aim is to convince the research community to spend more effort in developing metrics, instead of models.
arXiv Detail & Related papers (2025-05-28T15:10:33Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Alignment for Honesty [105.72465407518325]
Recent research has made significant strides in aligning large language models (LLMs) with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for emphhonesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge. We address these challenges by first establishing a precise problem definition and defining honesty'' inspired by the Analects of Confucius.
arXiv Detail & Related papers (2023-12-12T06:10:42Z)
FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. We collect responses generated from large language models and annotate factuality labels in a fine-grained manner. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
Evaluating the Factual Consistency of Large Language Models Through News Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z)
TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z)
Probing Factually Grounded Content Transfer with Factual Ablation [68.78413677690321]
Grounded generation draws on a reliable external document (grounding) for factual information. Measuring factuality is also simplified--to factual consistency, testing whether the generation agrees with the grounding, rather than all facts. We study this problem for content transfer, in which generations extend a prompt, using information from factual grounding.
arXiv Detail & Related papers (2022-03-18T19:18:54Z)
On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments. We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases. We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z)
Improving Factual Consistency of Abstractive Summarization via Question Answering [25.725873545789046]
We present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency. We then propose a novel learning algorithm that maximizes the proposed metric during model training.
arXiv Detail & Related papers (2021-05-10T19:07:21Z)
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics [17.677637487977208]
Modern summarization models generate highly fluent but often factually unreliable outputs. Due to the lack of common benchmarks, metrics attempting to measure the factuality of automatically generated summaries cannot be compared. We devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems.
arXiv Detail & Related papers (2021-04-27T17:28:07Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.