Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation
- URL: http://arxiv.org/abs/2212.07981v2
- Date: Tue, 6 Jun 2023 07:45:27 GMT
- Title: Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation
- Authors: Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan,
Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir
Radev
- Abstract summary: Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
- Score: 136.16507050034755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation is the foundation upon which the evaluation of both
summarization systems and automatic metrics rests. However, existing human
evaluation studies for summarization either exhibit a low inter-annotator
agreement or have insufficient scale, and an in-depth analysis of human
evaluation is lacking. Therefore, we address the shortcomings of existing
summarization evaluation along the following axes: (1) We propose a modified
summarization salience protocol, Atomic Content Units (ACUs), which is based on
fine-grained semantic units and allows for a high inter-annotator agreement.
(2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large
human evaluation dataset consisting of 22,000 summary-level annotations over 28
top-performing systems on three datasets. (3) We conduct a comparative study of
four human evaluation protocols, underscoring potential confounding factors in
evaluation setups. (4) We evaluate 50 automatic metrics and their variants
using the collected human annotations across evaluation protocols and
demonstrate how our benchmark leads to more statistically stable and
significant results. The metrics we benchmarked include recent methods based on
large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings
have important implications for evaluating LLMs, as we show that LLMs adjusted
by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation,
which is affected by the annotators' prior, input-agnostic preferences, calling
for more robust, targeted evaluation methods.
Related papers
- ReIFE: Re-evaluating Instruction-Following Evaluation [105.75525154888655]
We present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 proposed evaluation protocols.
Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness.
arXiv Detail & Related papers (2024-10-09T17:14:50Z) - Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels.
We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model.
To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.
It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.
On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Human-like Summarization Evaluation with ChatGPT [38.39767193442397]
ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation.
It outperformed commonly used automatic evaluation metrics on some datasets.
arXiv Detail & Related papers (2023-04-05T16:17:32Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.