Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
Practices for Generated Text
- URL: http://arxiv.org/abs/2202.06935v1
- Date: Mon, 14 Feb 2022 18:51:07 GMT
- Title: Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
Practices for Generated Text
- Authors: Sebastian Gehrmann, Elizabeth Clark, Thibault Sellam
- Abstract summary: Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG.
- Score: 23.119724118572538
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Evaluation practices in natural language generation (NLG) have many known
flaws, but improved evaluation approaches are rarely widely adopted. This issue
has become more urgent, since neural NLG models have improved to the point
where they can often no longer be distinguished based on the surface-level
features that older metrics rely on. This paper surveys the issues with human
and automatic model evaluations and with commonly used datasets in NLG that
have been pointed out over the past 20 years. We summarize, categorize, and
discuss how researchers have been addressing these issues and what their
findings mean for the current state of model evaluations. Building on those
insights, we lay out a long-term vision for NLG evaluation and propose concrete
steps for researchers to improve their evaluation processes. Finally, we
analyze 66 NLG papers from recent NLP conferences in how well they already
follow these suggestions and identify which areas require more drastic changes
to the status quo.
Related papers
- Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - LLM-based NLG Evaluation: Current Status and Challenges [41.69249290537395]
evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence.
Large language models (LLMs) have demonstrated great potential in NLG evaluation in recent years.
Various automatic evaluation methods based on LLMs have been proposed.
arXiv Detail & Related papers (2024-02-02T13:06:35Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Near-Negative Distinction: Giving a Second Life to Human Evaluation
Datasets [95.4182455942628]
We propose Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.
In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.
We show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics.
arXiv Detail & Related papers (2022-05-13T20:02:53Z) - Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
Their Implications [85.24952708195582]
This study examines the goals, community practices, assumptions, and constraints that shape NLG evaluations.
We examine their implications and how they embody ethical considerations.
arXiv Detail & Related papers (2022-05-13T18:00:11Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z) - A Survey of Evaluation Metrics Used for NLG Systems [19.20118684502313]
The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks.
Unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge.
The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014.
arXiv Detail & Related papers (2020-08-27T09:25:05Z) - Evaluation of Text Generation: A Survey [107.62760642328455]
The paper surveys evaluation methods of natural language generation systems that have been developed in the last few years.
We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics.
arXiv Detail & Related papers (2020-06-26T04:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.