Related papers: SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

URL: http://arxiv.org/abs/2207.04660v1
Date: Mon, 11 Jul 2022 06:47:29 GMT
Title: SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder
Authors: Wuhang Lin, Shasha Li, Chen Zhang, Bin Ji, Jie Yu, Jun Ma, Zibo Yi
Abstract summary: SummScore is a comprehensive metric for summary quality evaluation based on CrossEncoder. To improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring.
Score: 12.913447457411317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text summarization models are often trained to produce summaries that meet human quality requirements. However, the existing evaluation metrics for summary text are only rough proxies for summary quality, suffering from low correlation with human scoring and inhibition of summary diversity. To solve these problems, we propose SummScore, a comprehensive metric for summary quality evaluation based on CrossEncoder. Firstly, by adopting the original-summary measurement mode and comparing the semantics of the original text, SummScore gets rid of the inhibition of summary diversity. With the help of the text-matching pre-training Cross-Encoder, SummScore can effectively capture the subtle differences between the semantics of summaries. Secondly, to improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels, which measure Coherence, Consistency, Fluency, and Relevance separately. We use semi-supervised multi-rounds of training to improve the performance of our model on extremely limited annotated data. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring. We also provide the quality evaluation results of SummScore on 16 mainstream summarization models for later research.

Related papers

Assessment of Transformer-Based Encoder-Decoder Model for Human-Like Summarization [0.05852077003870416]
This work leverages transformer-based BART model for human-like summarization. On training and fine-tuning the encoder-decoder model, it is tested with diverse sample articles. The finetuned model performance is compared with the baseline pretrained model. Empirical results on BBC News articles highlight that the gold standard summaries written by humans are more factually consistent by 17%.
arXiv Detail & Related papers (2024-10-22T09:25:04Z)
Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates. The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z)
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods. We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary. We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization. Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.