Does Putting a Linguist in the Loop Improve NLU Data Collection?
- URL: http://arxiv.org/abs/2104.07179v1
- Date: Thu, 15 Apr 2021 00:31:10 GMT
- Title: Does Putting a Linguist in the Loop Improve NLU Data Collection?
- Authors: Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia,
Alex Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen and Samuel R.
Bowman
- Abstract summary: Crowdsourcing NLP datasets contain systematic gaps and biases that are identified only after data collection is complete.
We take natural language inference as a test case and ask whether it is beneficial to put a linguist in the loop' during data collection.
- Score: 34.34874979524489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many crowdsourced NLP datasets contain systematic gaps and biases that are
identified only after data collection is complete. Identifying these issues
from early data samples during crowdsourcing should make mitigation more
efficient, especially when done iteratively. We take natural language inference
as a test case and ask whether it is beneficial to put a linguist `in the loop'
during data collection to dynamically identify and address gaps in the data by
introducing novel constraints on the task. We directly compare three data
collection protocols: (i) a baseline protocol, (ii) a linguist-in-the-loop
intervention with iteratively-updated constraints on the task, and (iii) an
extension of linguist-in-the-loop that provides direct interaction between
linguists and crowdworkers via a chatroom. The datasets collected with linguist
involvement are more reliably challenging than baseline, without loss of
quality. But we see no evidence that using this data in training leads to
better out-of-domain model performance, and the addition of a chat platform has
no measurable effect on the resulting dataset. We suggest integrating expert
analysis \textit{during} data collection so that the expert can dynamically
address gaps and biases in the dataset.
Related papers
- DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - What Ingredients Make for an Effective Crowdsourcing Protocol for
Difficult NLU Data Collection Tasks? [31.39009622826369]
We compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality.
We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty.
We observe that the data from the iterative protocol with expert assessments is more challenging by several measures.
arXiv Detail & Related papers (2021-06-01T21:05:52Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual
and Zero-shot Conditions [18.755176247223616]
We compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting.
We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies.
arXiv Detail & Related papers (2021-03-01T19:34:32Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.