Finding a Balanced Degree of Automation for Summary Evaluation
- URL: http://arxiv.org/abs/2109.11503v1
- Date: Thu, 23 Sep 2021 17:12:35 GMT
- Title: Finding a Balanced Degree of Automation for Summary Evaluation
- Authors: Shiyue Zhang, Mohit Bansal
- Abstract summary: We propose flexible semiautomatic to automatic summary evaluation metrics.
Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s)
Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs)
- Score: 83.08810773093882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation for summarization tasks is reliable but brings in issues of
reproducibility and high costs. Automatic metrics are cheap and reproducible
but sometimes poorly correlated with human judgment. In this work, we propose
flexible semiautomatic to automatic summary evaluation metrics, following the
Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the
reusable human-labeled Summary Content Units (SCUs) for reference(s) but
replaces the manual work of judging SCUs' presence in system summaries with a
natural language inference (NLI) model. Fully automatic Lite3Pyramid further
substitutes SCUs with automatically extracted Semantic Triplet Units (STUs) via
a semantic role labeling (SRL) model. Finally, we propose in-between metrics,
Lite2.xPyramid, where we use a simple regressor to predict how well the STUs
can simulate SCUs and retain SCUs that are more difficult to simulate, which
provides a smooth transition and balance between automation and manual
evaluation. Comparing to 15 existing metrics, we evaluate human-metric
correlations on 3 existing meta-evaluation datasets and our newly-collected
PyrXSum (with 100/10 XSum examples/systems). It shows that Lite2Pyramid
consistently has the best summary-level correlations; Lite3Pyramid works better
than or comparable to other automatic metrics; Lite2.xPyramid trades off small
correlation drops for larger manual effort reduction, which can reduce costs
for future data collection. Our code and data are publicly available at:
https://github.com/ZhangShiyue/Lite2-3Pyramid
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - On the Role of Summary Content Units in Text Summarization Evaluation [39.054511238166796]
We show two novel strategies to approximate written summary content units (SCUs)
We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs.
We also show through a simple sentence-decomposition baseline (SSUs) that SCUs offer the most value when ranking short summaries, but may not help as much when ranking systems or longer summaries.
arXiv Detail & Related papers (2024-04-02T07:09:44Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations [22.563596069176047]
We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
arXiv Detail & Related papers (2023-05-23T05:00:59Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.