On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
- URL: http://arxiv.org/abs/2212.10020v3
- Date: Thu, 18 May 2023 20:46:57 GMT
- Title: On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
- Authors: Tianxing He, Jingyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho,
James Glass, Yulia Tsvetkov
- Abstract summary: We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
- Score: 79.01422521024834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore a useful but often neglected methodology for
robustness analysis of text generation evaluation metrics: stress tests with
synthetic data. Basically, we design and synthesize a wide range of potential
errors and check whether they result in a commensurate drop in the metric
scores. We examine a range of recently proposed evaluation metrics based on
pretrained language models, for the tasks of open-ended generation,
translation, and summarization. Our experiments reveal interesting
insensitivities, biases, or even loopholes in existing metrics. For example, we
find that BERTScore is confused by truncation errors in summarization, and
MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or
middle of generations. Further, we investigate the reasons behind these blind
spots and suggest practical workarounds for a more reliable evaluation of text
generation. We have released our code and data at
https://github.com/cloudygoose/blindspot_nlg.
Related papers
- Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - SUT: Active Defects Probing for Transcompiler Models [24.01532199512389]
We introduce a new metrics for programming language translation and these metrics address basic syntax errors.
Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests.
arXiv Detail & Related papers (2023-10-22T07:16:02Z) - TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks [44.801746603656504]
We present TIGERScore, a metric that follows textbfInstruction textbfGuidance to perform textbfExplainable and textbfReference-free evaluation.
Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct.
arXiv Detail & Related papers (2023-10-01T18:01:51Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [92.42032403795879]
UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
arXiv Detail & Related papers (2020-09-16T11:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.