Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations
- URL: http://arxiv.org/abs/2305.13693v1
- Date: Tue, 23 May 2023 05:00:59 GMT
- Title: Automated Metrics for Medical Multi-Document Summarization Disagree with
Human Evaluations
- Authors: Lucy Lu Wang, Yulia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey
E. Kuehl, Erin Bransom, Byron C. Wallace
- Abstract summary: We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries.
We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
- Score: 22.563596069176047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating multi-document summarization (MDS) quality is difficult. This is
especially true in the case of MDS for biomedical literature reviews, where
models must synthesize contradicting evidence reported across different
documents. Prior work has shown that rather than performing the task, models
may exploit shortcuts that are difficult to detect using standard n-gram
similarity metrics such as ROUGE. Better automated evaluation metrics are
needed, but few resources exist to assess metrics when they are proposed.
Therefore, we introduce a dataset of human-assessed summary quality facets and
pairwise preferences to encourage and support the development of better
automated evaluation methods for literature review MDS. We take advantage of
community submissions to the Multi-document Summarization for Literature Review
(MSLR) shared task to compile a diverse and representative sample of generated
summaries. We analyze how automated summarization evaluation metrics correlate
with lexical features of generated summaries, to other automated metrics
including several we propose in this work, and to aspects of human-assessed
summary quality. We find that not only do automated metrics fail to capture
aspects of quality as assessed by humans, in many cases the system rankings
produced by these metrics are anti-correlated with rankings according to human
annotators.
Related papers
- A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Knowledge-Centric Templatic Views of Documents [2.654058995940072]
Authors often share their ideas in various document formats, such as slide decks, newsletters, reports, and posters.
We introduce a novel unified evaluation framework that can be adapted to measuring the quality of document generators.
We conduct a human evaluation, which shows that people prefer 82% of the documents generated with our method.
arXiv Detail & Related papers (2024-01-13T01:22:15Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation
for Machine Translation [19.116396693370422]
We propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics framework.
We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs.
We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers.
arXiv Detail & Related papers (2021-04-29T16:42:09Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.