Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
Their Implications
- URL: http://arxiv.org/abs/2205.06828v1
- Date: Fri, 13 May 2022 18:00:11 GMT
- Title: Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
Their Implications
- Authors: Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daum\'e III, Kaheer
Suleman, Alexandra Olteanu
- Abstract summary: This study examines the goals, community practices, assumptions, and constraints that shape NLG evaluations.
We examine their implications and how they embody ethical considerations.
- Score: 85.24952708195582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are many ways to express similar things in text, which makes evaluating
natural language generation (NLG) systems difficult. Compounding this
difficulty is the need to assess varying quality criteria depending on the
deployment setting. While the landscape of NLG evaluation has been well-mapped,
practitioners' goals, assumptions, and constraints -- which inform decisions
about what, when, and how to evaluate -- are often partially or implicitly
stated, or not stated at all. Combining a formative semi-structured interview
study of NLG practitioners (N=18) with a survey study of a broader sample of
practitioners (N=61), we surface goals, community practices, assumptions, and
constraints that shape NLG evaluations, examining their implications and how
they embody ethical considerations.
Related papers
- Large Language Models Are Active Critics in NLG Evaluation [9.932334723464129]
We introduce Active-Critic, a novel method for evaluating natural language generation (NLG) systems.
The protocol enables large language models (LLMs) to function as ''active critics''
Experiments show that our approach achieves stronger alignment with human judgments than state-of-the-art evaluation methods.
arXiv Detail & Related papers (2024-10-14T17:04:41Z) - Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects [32.50977115108103]
We introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users.
X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality.
arXiv Detail & Related papers (2023-11-15T09:01:55Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
Practices for Generated Text [23.119724118572538]
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG.
arXiv Detail & Related papers (2022-02-14T18:51:07Z) - Evaluation of Text Generation: A Survey [107.62760642328455]
The paper surveys evaluation methods of natural language generation systems that have been developed in the last few years.
We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics.
arXiv Detail & Related papers (2020-06-26T04:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.