Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks
- URL: http://arxiv.org/abs/2305.15044v1
- Date: Wed, 24 May 2023 11:34:39 GMT
- Title: Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks
- Authors: Xiao Pu, Mingqi Gao, Xiaojun Wan
- Abstract summary: This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
- Score: 45.550554287918885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research on automated text summarization relies heavily on human and
automatic evaluation. While recent work on human evaluation mainly adopted
intrinsic evaluation methods, judging the generic quality of text summaries,
e.g. informativeness and coherence, our work focuses on evaluating the
usefulness of text summaries with extrinsic methods. We carefully design three
different downstream tasks for extrinsic human evaluation of summaries, i.e.,
question answering, text classification and text similarity assessment. We
carry out experiments using system rankings and user behavior data to evaluate
the performance of different summarization models. We find summaries are
particularly useful in tasks that rely on an overall judgment of the text,
while being less effective for question answering tasks. The results show that
summaries generated by fine-tuned models lead to higher consistency in
usefulness across all three tasks, as rankings of fine-tuned summarization
systems are close across downstream tasks according to the proposed extrinsic
metrics. Summaries generated by models in the zero-shot setting, however, are
found to be biased towards the text classification and similarity assessment
tasks, due to its general and less detailed summary style. We further evaluate
the correlation of 14 intrinsic automatic metrics with human criteria and show
that intrinsic automatic metrics perform well in evaluating the usefulness of
summaries in the question-answering task, but are less effective in the other
two tasks. This highlights the limitations of relying solely on intrinsic
automatic metrics in evaluating the performance and usefulness of summaries.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - SummScore: A Comprehensive Evaluation Metric for Summary Quality Based
on Cross-Encoder [12.913447457411317]
SummScore is a comprehensive metric for summary quality evaluation based on CrossEncoder.
To improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels.
Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring.
arXiv Detail & Related papers (2022-07-11T06:47:29Z) - Factual Consistency Evaluation for Text Summarization via Counterfactual
Estimation [42.63902468258758]
We propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation.
We conduct a series of experiments on three public abstractive text summarization datasets.
arXiv Detail & Related papers (2021-08-30T11:48:41Z) - Metrics also Disagree in the Low Scoring Range: Revisiting Summarization
Evaluation Metrics [20.105119107290488]
One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries.
We find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range.
Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement.
arXiv Detail & Related papers (2020-11-08T22:26:06Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.