Fine-Grained Natural Language Inference Based Faithfulness Evaluation
for Diverse Summarisation Tasks
- URL: http://arxiv.org/abs/2402.17630v1
- Date: Tue, 27 Feb 2024 15:57:11 GMT
- Title: Fine-Grained Natural Language Inference Based Faithfulness Evaluation
for Diverse Summarisation Tasks
- Authors: Huajian Zhang, Yumo Xu, Laura Perez-Beltrachini
- Abstract summary: We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness.
We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses.
- Score: 14.319567507959759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study existing approaches to leverage off-the-shelf Natural Language
Inference (NLI) models for the evaluation of summary faithfulness and argue
that these are sub-optimal due to the granularity level considered for premises
and hypotheses. That is, the smaller content unit considered as hypothesis is a
sentence and premises are made up of a fixed number of document sentences. We
propose a novel approach, namely InFusE, that uses a variable premise size and
simplifies summary sentences into shorter hypotheses. Departing from previous
studies which focus on single short document summarisation, we analyse NLI
based faithfulness evaluation for diverse summarisation tasks. We introduce
DiverSumm, a new benchmark comprising long form summarisation (long documents
and summaries) and diverse summarisation tasks (e.g., meeting and
multi-document summarisation). In experiments, InFusE obtains superior
performance across the different summarisation tasks. Our code and data are
available at https://github.com/HJZnlp/infuse.
Related papers
- Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization [48.57273563299046]
We propose the task of Stepwise Summarization, which aims to generate a new appended summary each time a new document is proposed.
The appended summary should not only summarize the newly added content but also be coherent with the previous summary.
We show that SSG achieves state-of-the-art performance in terms of both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-06-08T05:37:26Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks.
We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z) - UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot
Summarization [54.59104881168188]
textscUniSumm is a unified few-shot summarization model pre-trained with multiple summarization tasks.
textscSummZoo is a new benchmark to better evaluate few-shot summarizers.
arXiv Detail & Related papers (2022-11-17T18:54:47Z) - R$^2$F: A General Retrieval, Reading and Fusion Framework for
Document-level Natural Language Inference [29.520857954199904]
Document-level natural language inference (DOCNLI) is a new challenging task in natural language processing.
We establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting.
Our experimental results show that R2F framework can obtain state-of-the-art performance and is robust for diverse evidence retrieval methods.
arXiv Detail & Related papers (2022-10-22T02:02:35Z) - Stretching Sentence-pair NLI Models to Reason over Long Documents and
Clusters [35.103851212995046]
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs.
We explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on.
We develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.
arXiv Detail & Related papers (2022-04-15T12:56:39Z) - Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum.
By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.