Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries
- URL: http://arxiv.org/abs/2109.09195v2
- Date: Tue, 21 Sep 2021 03:24:41 GMT
- Title: Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries
- Authors: Xiangru Tang, Alexander R. Fabbri, Ziming Mao, Griffin Adams, Borui
Wang, Haoran Li, Yashar Mehdad, Dragomir Radev
- Abstract summary: Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
- Score: 59.27273928454995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current pre-trained models applied to summarization are prone to factual
inconsistencies which either misrepresent the source text or introduce
extraneous information. Thus, comparing the factual consistency of summaries is
necessary as we develop improved models. However, the optimal human evaluation
setup for factual consistency has not been standardized. To address this issue,
we crowdsourced evaluations for factual consistency using the rating-based
Likert scale and ranking-based Best-Worst Scaling protocols, on 100 articles
from each of the CNN-Daily Mail and XSum datasets over four state-of-the-art
models, to determine the most reliable evaluation framework. We find that
ranking-based protocols offer a more reliable measure of summary quality across
datasets, while the reliability of Likert ratings depends on the target dataset
and the evaluation design. Our crowdsourcing templates and summary evaluations
will be publicly available to facilitate future research on factual consistency
in summarization.
Related papers
- MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures [28.130008435669865]
We introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize evaluations across diverse input and output modalities.
We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions.
arXiv Detail & Related papers (2024-10-17T16:52:28Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - Questioning the Validity of Summarization Datasets and Improving Their
Factual Consistency [14.974996886744083]
We release SummFC, a filtered summarization dataset with improved factual consistency.
We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
arXiv Detail & Related papers (2022-10-31T15:04:20Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.