What Can We Learn from Collective Human Opinions on Natural Language
Inference Data?
- URL: http://arxiv.org/abs/2010.03532v2
- Date: Thu, 8 Oct 2020 19:32:45 GMT
- Title: What Can We Learn from Collective Human Opinions on Natural Language
Inference Data?
- Authors: Yixin Nie, Xiang Zhou, Mohit Bansal
- Abstract summary: ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS.
This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
- Score: 88.90490998032429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the subjective nature of many NLP tasks, most NLU evaluations have
focused on using the majority label with presumably high agreement as the
ground truth. Less attention has been paid to the distribution of human
opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to
study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset
is created by collecting 100 annotations per example for 3,113 examples in SNLI
and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high
human disagreement exists in a noticeable amount of examples in these datasets;
(2) the state-of-the-art models lack the ability to recover the distribution
over human labels; (3) models achieve near-perfect accuracy on the subset of
data with a high level of human agreement, whereas they can barely beat a
random guess on the data with low levels of human agreement, which compose most
of the common errors made by state-of-the-art models on the evaluation sets.
This questions the validity of improving model performance on old metrics for
the low-agreement part of evaluation datasets. Hence, we argue for a detailed
examination of human agreement in future data collection efforts, and
evaluating model outputs against the distribution over collective human
opinions. The ChaosNLI dataset and experimental scripts are available at
https://github.com/easonnie/ChaosNLI
Related papers
- AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference [22.13596750775719]
We introduce a novel human summarization preference alignment framework AlignSum.
With AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations.
arXiv Detail & Related papers (2024-10-01T05:14:48Z) - Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? [2.298227866545911]
Foundation models have emerged as robust models with label efficiency in diverse domains.
It is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model.
This research examines the bias in the Foundation model when it is applied to fine-tune the Brazilian Multilabel Ophthalmological dataset.
arXiv Detail & Related papers (2024-08-28T22:14:44Z) - Designing NLP Systems That Adapt to Diverse Worldviews [4.915541242112533]
We argue that existing NLP datasets often obscure this by aggregating labels or filtering out disagreement.
We propose a perspectivist approach: building datasets that capture annotator demographics, values, and justifications for their labels.
arXiv Detail & Related papers (2024-05-18T06:48:09Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - NLI Data Sanity Check: Assessing the Effect of Data Corruption on Model
Performance [3.7024660695776066]
We propose a new diagnostics test suite which allows to assess whether a dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
We specifically apply controlled corruption transformations to widely used benchmarks (MNLI and ANLI)
A large decrease in model accuracy indicates that the original dataset provides a proper challenge to the models' reasoning capabilities.
arXiv Detail & Related papers (2021-04-10T12:28:07Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.