Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation
- URL: http://arxiv.org/abs/2212.07981v2
- Date: Tue, 6 Jun 2023 07:45:27 GMT
- Title: Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation
- Authors: Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan,
Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir
Radev
- Abstract summary: Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
- Score: 136.16507050034755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation is the foundation upon which the evaluation of both
summarization systems and automatic metrics rests. However, existing human
evaluation studies for summarization either exhibit a low inter-annotator
agreement or have insufficient scale, and an in-depth analysis of human
evaluation is lacking. Therefore, we address the shortcomings of existing
summarization evaluation along the following axes: (1) We propose a modified
summarization salience protocol, Atomic Content Units (ACUs), which is based on
fine-grained semantic units and allows for a high inter-annotator agreement.
(2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large
human evaluation dataset consisting of 22,000 summary-level annotations over 28
top-performing systems on three datasets. (3) We conduct a comparative study of
four human evaluation protocols, underscoring potential confounding factors in
evaluation setups. (4) We evaluate 50 automatic metrics and their variants
using the collected human annotations across evaluation protocols and
demonstrate how our benchmark leads to more statistically stable and
significant results. The metrics we benchmarked include recent methods based on
large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings
have important implications for evaluating LLMs, as we show that LLMs adjusted
by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation,
which is affected by the annotators' prior, input-agnostic preferences, calling
for more robust, targeted evaluation methods.
Related papers
- Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation [30.674896082482476]
We show that Op-I-Prompt emerges as a good alternative for evaluating opinion summaries achieving an average Spearman correlation of 0.70 with humans.
To the best of our knowledge, we are the first to investigate LLMs as evaluators on both closed-source and open-source models in the opinion summarization domain.
arXiv Detail & Related papers (2024-02-18T19:13:52Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Human-like Summarization Evaluation with ChatGPT [38.39767193442397]
ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation.
It outperformed commonly used automatic evaluation metrics on some datasets.
arXiv Detail & Related papers (2023-04-05T16:17:32Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.