What Ingredients Make for an Effective Crowdsourcing Protocol for
Difficult NLU Data Collection Tasks?
- URL: http://arxiv.org/abs/2106.00794v1
- Date: Tue, 1 Jun 2021 21:05:52 GMT
- Title: What Ingredients Make for an Effective Crowdsourcing Protocol for
Difficult NLU Data Collection Tasks?
- Authors: Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Warstadt, Clara
Vania, Samuel R. Bowman
- Abstract summary: We compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality.
We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty.
We observe that the data from the iterative protocol with expert assessments is more challenging by several measures.
- Score: 31.39009622826369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowdsourcing is widely used to create data for common natural language
understanding tasks. Despite the importance of these datasets for measuring and
refining model understanding of language, there has been little focus on the
crowdsourcing methods used for collecting the datasets. In this paper, we
compare the efficacy of interventions that have been proposed in prior work as
ways of improving data quality. We use multiple-choice question answering as a
testbed and run a randomized trial by assigning crowdworkers to write questions
under one of four different data collection protocols. We find that asking
workers to write explanations for their examples is an ineffective stand-alone
strategy for boosting NLU example difficulty. However, we find that training
crowdworkers, and then using an iterative process of collecting data, sending
feedback, and qualifying workers based on expert judgments is an effective
means of collecting challenging data. But using crowdsourced, instead of expert
judgments, to qualify workers and send feedback does not prove to be effective.
We observe that the data from the iterative protocol with expert assessments is
more challenging by several measures. Notably, the human--model gap on the
unanimous agreement portion of this data is, on average, twice as large as the
gap for the baseline protocol data.
Related papers
- A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - From Random to Informed Data Selection: A Diversity-Based Approach to
Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning.
Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases.
This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z) - Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled
Datasets [73.2096288987301]
We propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset.
We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data.
Our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images.
arXiv Detail & Related papers (2023-04-18T05:42:53Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative
Dataset to Fight Online Hate Speech [10.323063834827416]
Undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities.
We propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively.
Results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection.
arXiv Detail & Related papers (2021-07-19T09:45:54Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Does Putting a Linguist in the Loop Improve NLU Data Collection? [34.34874979524489]
Crowdsourcing NLP datasets contain systematic gaps and biases that are identified only after data collection is complete.
We take natural language inference as a test case and ask whether it is beneficial to put a linguist in the loop' during data collection.
arXiv Detail & Related papers (2021-04-15T00:31:10Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.