Near-Negative Distinction: Giving a Second Life to Human Evaluation
Datasets
- URL: http://arxiv.org/abs/2205.06871v1
- Date: Fri, 13 May 2022 20:02:53 GMT
- Title: Near-Negative Distinction: Giving a Second Life to Human Evaluation
Datasets
- Authors: Philippe Laban and Chien-Sheng Wu and Wenhao Liu and Caiming Xiong
- Abstract summary: We propose Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.
In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.
We show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics.
- Score: 95.4182455942628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precisely assessing the progress in natural language generation (NLG) tasks
is challenging, and human evaluation to establish preference in a model's
output over another is often necessary. However, human evaluation is usually
costly, difficult to reproduce, and non-reusable. In this paper, we propose a
new and simple automatic evaluation method for NLG called Near-Negative
Distinction (NND) that repurposes prior human annotations into NND tests. In an
NND test, an NLG model must place higher likelihood on a high-quality output
candidate than on a near-negative candidate with a known error. Model
performance is established by the number of NND tests a model passes, as well
as the distribution over task-specific errors the model fails on. Through
experiments on three NLG tasks (question generation, question answering, and
summarization), we show that NND achieves higher correlation with human
judgments than standard NLG evaluation metrics. We then illustrate NND
evaluation in four practical scenarios, for example performing fine-grain model
analysis, or studying model training dynamics. Our findings suggest NND can
give a second life to human annotations and provide low-cost NLG evaluation.
Related papers
- Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability [39.12792986841385]
In this paper, we construct a large-scale NLG evaluation corpus NLG-Eval with annotations from both human and GPT-4.
We also propose an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
arXiv Detail & Related papers (2024-06-26T14:04:29Z) - GNNEvaluator: Evaluating GNN Performance On Unseen Graphs Without Labels [81.93520935479984]
We study a new problem, GNN model evaluation, that aims to assess the performance of a specific GNN model trained on labeled and observed graphs.
We propose a two-stage GNN model evaluation framework, including (1) DiscGraph set construction and (2) GNNEvaluator training and inference.
Under the effective training supervision from the DiscGraph set, GNNEvaluator learns to precisely estimate node classification accuracy of the to-be-evaluated GNN model.
arXiv Detail & Related papers (2023-10-23T05:51:59Z) - No Strong Feelings One Way or Another: Re-operationalizing Neutrality in
Natural Language Inference [6.485890157501745]
Natural Language Inference (NLI) has been a cornerstone task in evaluating language models' inferential reasoning capabilities.
Standard three-way classification scheme used in NLI has well-known shortcomings in evaluating models' ability to capture the nuances of natural human reasoning.
We argue that the operationalization of the neutral label in current NLI datasets has low validity, is interpreted inconsistently, and that at least one important sense of neutrality is often ignored.
arXiv Detail & Related papers (2023-06-16T15:45:08Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Equitable Ability Estimation in Neurodivergent Student Populations with
Zero-Inflated Learner Models [3.418206750929592]
This paper attempts to model the relationships between context (delivery and response types) and performance of ND students with zero-inflated learner models.
This approach facilitates simulation of several expected ND behavioural traits, provides equitable ability estimates across all student groups from generated datasets, increases interpretability confidence, and can significantly increase the quality of learning opportunities for ND students.
arXiv Detail & Related papers (2022-03-18T21:47:01Z) - Dual Inference for Improving Language Understanding and Generation [35.251935231914366]
Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship.
NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite.
This paper proposes to leverage the duality in the inference stage without the need of retraining.
arXiv Detail & Related papers (2020-10-08T20:14:41Z) - What Can We Learn from Collective Human Opinions on Natural Language
Inference Data? [88.90490998032429]
ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS.
This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
arXiv Detail & Related papers (2020-10-07T17:26:06Z) - GraN: An Efficient Gradient-Norm Based Detector for Adversarial and
Misclassified Examples [77.99182201815763]
Deep neural networks (DNNs) are vulnerable to adversarial examples and other data perturbations.
GraN is a time- and parameter-efficient method that is easily adaptable to any DNN.
GraN achieves state-of-the-art performance on numerous problem set-ups.
arXiv Detail & Related papers (2020-04-20T10:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.