BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics
- URL: http://arxiv.org/abs/2212.09955v2
- Date: Mon, 5 Jun 2023 01:29:40 GMT
- Title: BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics
- Authors: Liang Ma, Shuyang Cao, Robert L. Logan IV, Di Lu, Shihao Ran, Ke
Zhang, Joel Tetreault, Alejandro Jaimes
- Abstract summary: We present a benchmark of unfaithful minimal pairs (BUMP)
BUMP is a dataset of 889 human-written, minimally different summary pairs.
Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
- Score: 70.52570641514146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of automatic faithfulness metrics for summarization has
produced a need for benchmarks to evaluate them. While existing benchmarks
measure the correlation with human judgements of faithfulness on
model-generated summaries, they are insufficient for diagnosing whether metrics
are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced
into a summary, 2) effective on human-written texts, and 3) sensitive to
different error types (as summaries can contain multiple errors). To address
these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a
dataset of 889 human-written, minimally different summary pairs, where a single
error is introduced to a summary from the CNN/DailyMail dataset to produce an
unfaithful summary. We find BUMP complements existing benchmarks in a number of
ways: 1) the summaries in BUMP are harder to discriminate and less probable
under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be
used to measure the consistency of metrics, and reveals that the most
discriminative metrics tend not to be the most consistent, and 3) unlike
datasets containing generated summaries with multiple errors, BUMP enables the
measurement of metrics' performance on individual error types.
Related papers
- STORYSUMM: Evaluating Faithfulness in Story Summarization [31.94902013480574]
We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations.
This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
arXiv Detail & Related papers (2024-07-09T02:06:30Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.