SummScore: A Comprehensive Evaluation Metric for Summary Quality Based
on Cross-Encoder
- URL: http://arxiv.org/abs/2207.04660v1
- Date: Mon, 11 Jul 2022 06:47:29 GMT
- Title: SummScore: A Comprehensive Evaluation Metric for Summary Quality Based
on Cross-Encoder
- Authors: Wuhang Lin, Shasha Li, Chen Zhang, Bin Ji, Jie Yu, Jun Ma, Zibo Yi
- Abstract summary: SummScore is a comprehensive metric for summary quality evaluation based on CrossEncoder.
To improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels.
Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring.
- Score: 12.913447457411317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text summarization models are often trained to produce summaries that meet
human quality requirements. However, the existing evaluation metrics for
summary text are only rough proxies for summary quality, suffering from low
correlation with human scoring and inhibition of summary diversity. To solve
these problems, we propose SummScore, a comprehensive metric for summary
quality evaluation based on CrossEncoder. Firstly, by adopting the
original-summary measurement mode and comparing the semantics of the original
text, SummScore gets rid of the inhibition of summary diversity. With the help
of the text-matching pre-training Cross-Encoder, SummScore can effectively
capture the subtle differences between the semantics of summaries. Secondly, to
improve the comprehensiveness and interpretability, SummScore consists of four
fine-grained submodels, which measure Coherence, Consistency, Fluency, and
Relevance separately. We use semi-supervised multi-rounds of training to
improve the performance of our model on extremely limited annotated data.
Extensive experiments show that SummScore significantly outperforms existing
evaluation metrics in the above four dimensions in correlation with human
scoring. We also provide the quality evaluation results of SummScore on 16
mainstream summarization models for later research.
Related papers
- A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.