Improving abstractive summarization with energy-based re-ranking
- URL: http://arxiv.org/abs/2210.15553v1
- Date: Thu, 27 Oct 2022 15:43:36 GMT
- Title: Improving abstractive summarization with energy-based re-ranking
- Authors: Diogo Pernes, Afonso Mendes, Andr\'e F.T. Martins
- Abstract summary: We propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics.
We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries.
- Score: 4.311978285976062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current abstractive summarization systems present important weaknesses which
prevent their deployment in real-world applications, such as the omission of
relevant information and the generation of factual inconsistencies (also known
as hallucinations). At the same time, automatic evaluation metrics such as CTC
scores have been recently proposed that exhibit a higher correlation with human
judgments than traditional lexical-overlap metrics such as ROUGE. In this work,
we intend to close the loop by leveraging the recent advances in summarization
metrics to create quality-aware abstractive summarizers. Namely, we propose an
energy-based model that learns to re-rank summaries according to one or a
combination of these metrics. We experiment using several metrics to train our
energy-based re-ranker and show that it consistently improves the scores
achieved by the predicted summaries. Nonetheless, human evaluation results show
that the re-ranking approach should be used with care for highly abstractive
summaries, as the available metrics are not yet sufficiently reliable for this
purpose.
Related papers
- FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization [3.5297361401370044]
The ROUGE metric has been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the extractive summarizer.
Previous research has introduced a gain-based automated metric called Sem-nCG that addresses these issues.
We propose a redundancy-aware Sem-nCG metric and demonstrate how it can be used to evaluate model summaries against multiple references.
arXiv Detail & Related papers (2023-08-04T11:47:19Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Understanding Factuality in Abstractive Summarization with FRANK: A
Benchmark for Factuality Metrics [17.677637487977208]
Modern summarization models generate highly fluent but often factually unreliable outputs.
Due to the lack of common benchmarks, metrics attempting to measure the factuality of automatically generated summaries cannot be compared.
We devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems.
arXiv Detail & Related papers (2021-04-27T17:28:07Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.