A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy
- URL: http://arxiv.org/abs/2106.13945v1
- Date: Sat, 26 Jun 2021 05:11:27 GMT
- Title: A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy
- Authors: Wang Chen, Piji Li, Irwin King
- Abstract summary: We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
- Score: 60.419107377879925
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, reference-based and supervised summarization evaluation
metrics have been widely explored. However, collecting human-annotated
references and ratings are costly and time-consuming. To avoid these
limitations, we propose a training-free and reference-free summarization
evaluation metric. Our metric consists of a centrality-weighted relevance score
and a self-referenced redundancy score. The relevance score is computed between
the pseudo reference built from the source document and the given summary,
where the pseudo reference content is weighted by the sentence centrality to
provide importance guidance. Besides an $F_1$-based relevance score, we also
design an $F_\beta$-based variant that pays more attention to the recall score.
As for the redundancy score of the summary, we compute a self-masked similarity
score with the summary itself to evaluate the redundant information in the
summary. Finally, we combine the relevance and redundancy scores to produce the
final evaluation score of the given summary. Extensive experiments show that
our methods can significantly outperform existing methods on both
multi-document and single-document summarization evaluation.
Related papers
- Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - Improving abstractive summarization with energy-based re-ranking [4.311978285976062]
We propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics.
We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries.
arXiv Detail & Related papers (2022-10-27T15:43:36Z) - Towards Interpretable Summary Evaluation via Allocation of Contextual
Embeddings to Reference Text Topics [1.5749416770494706]
The multifaceted interpretable summary evaluation method (MISEM) is based on allocation of a summary's contextual token embeddings to semantic topics identified in the reference text.
MISEM achieves a promising.404 Pearson correlation with human judgment on the TAC'08 dataset.
arXiv Detail & Related papers (2022-10-25T17:09:08Z) - Comparing Methods for Extractive Summarization of Call Centre Dialogue [77.34726150561087]
We experimentally compare several such methods by using them to produce summaries of calls, and evaluating these summaries objectively.
We found that TopicSum and Lead-N outperform the other summarisation methods, whilst BERTSum received comparatively lower scores in both subjective and objective evaluations.
arXiv Detail & Related papers (2022-09-06T13:16:02Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z) - Reference and Document Aware Semantic Evaluation Methods for Korean
Language Summarization [6.826626737986031]
We propose evaluation metrics that reflect semantic meanings of a reference summary and the original document.
We then propose a method for improving the correlation of the metrics with human judgment.
arXiv Detail & Related papers (2020-04-29T08:26:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.