FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction
- URL: http://arxiv.org/abs/2403.02270v3
- Date: Sat, 31 Aug 2024 07:30:59 GMT
- Title: FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction
- Authors: Alessandro Scirè, Karim Ghonim, Roberto Navigli,
- Abstract summary: We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
- Score: 85.26780391682894
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.
Related papers
- Fine-Grained Natural Language Inference Based Faithfulness Evaluation
for Diverse Summarisation Tasks [14.319567507959759]
We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness.
We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses.
arXiv Detail & Related papers (2024-02-27T15:57:11Z) - TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization [29.49641083851667]
We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes.
We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
arXiv Detail & Related papers (2024-02-20T18:58:49Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - How Far are We from Robust Long Abstractive Summarization? [39.34743996451813]
We evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries.
For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary.
We release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.
arXiv Detail & Related papers (2022-10-30T03:19:50Z) - Factual Consistency Evaluation for Text Summarization via Counterfactual
Estimation [42.63902468258758]
We propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation.
We conduct a series of experiments on three public abstractive text summarization datasets.
arXiv Detail & Related papers (2021-08-30T11:48:41Z) - Understanding Factuality in Abstractive Summarization with FRANK: A
Benchmark for Factuality Metrics [17.677637487977208]
Modern summarization models generate highly fluent but often factually unreliable outputs.
Due to the lack of common benchmarks, metrics attempting to measure the factuality of automatically generated summaries cannot be compared.
We devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems.
arXiv Detail & Related papers (2021-04-27T17:28:07Z) - Enhancing Factual Consistency of Abstractive Summarization [57.67609672082137]
We propose a fact-aware summarization model FASum to extract and integrate factual relations into the summary generation process.
We then design a factual corrector model FC to automatically correct factual errors from summaries generated by existing systems.
arXiv Detail & Related papers (2020-03-19T07:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.