On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study
- URL: http://arxiv.org/abs/2106.00872v1
- Date: Wed, 2 Jun 2021 00:48:33 GMT
- Title: On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study
- Authors: Divyansh Kaushik, Douwe Kiela, Zachary C. Lipton, Wen-tau Yih
- Abstract summary: In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
- Score: 65.17429512679695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In adversarial data collection (ADC), a human workforce interacts with a
model in real time, attempting to produce examples that elicit incorrect
predictions. Researchers hope that models trained on these more challenging
datasets will rely less on superficial patterns, and thus be less brittle.
However, despite ADC's intuitive appeal, it remains unclear when training on
adversarial datasets produces more robust models. In this paper, we conduct a
large-scale controlled study focused on question answering, assigning workers
at random to compose questions either (i) adversarially (with a model in the
loop); or (ii) in the standard fashion (without a model). Across a variety of
models and datasets, we find that models trained on adversarial data usually
perform better on other adversarial datasets but worse on a diverse collection
of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of
adversarial (vs standard) data, identifying key differences and offering
guidance for future research.
Related papers
- A Study on Domain Generalization for Failure Detection through Human
Reactions in HRI [7.664159325276515]
Machine learning models are commonly tested in-distribution (same dataset); performance almost always drops in out-of-distribution settings.
This makes domain generalization - retaining performance in different settings - a critical issue.
We present a concise analysis of domain generalization in failure detection models trained on human facial expressions.
arXiv Detail & Related papers (2024-03-10T21:30:22Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Towards Robust Visual Question Answering: Making the Most of Biased
Samples via Contrastive Learning [54.61762276179205]
We propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples.
Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples.
We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.
arXiv Detail & Related papers (2022-10-10T11:05:21Z) - Sharing pattern submodels for prediction with missing values [12.981974894538668]
Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time.
We propose an alternative approach, called sharing pattern submodels, which i) makes predictions robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels andiii) has a short description, enabling improved interpretability.
arXiv Detail & Related papers (2022-06-22T15:09:40Z) - Zero-shot meta-learning for small-scale data from human subjects [10.320654885121346]
We develop a framework to rapidly adapt to a new prediction task with limited training data for out-of-sample test data.
Our model learns the latent treatment effects of each intervention and, by design, can naturally handle multi-task predictions.
Our model has implications for improved generalization of small-size human studies to the wider population.
arXiv Detail & Related papers (2022-03-29T17:42:04Z) - Analyzing Dynamic Adversarial Training Data in the Limit [50.00850852546616]
Dynamic adversarial data collection (DADC) holds promise as an approach for generating such diverse training sets.
We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs.
Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data.
arXiv Detail & Related papers (2021-10-16T08:48:52Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Improving Question Answering Model Robustness with Synthetic Adversarial
Data Generation [41.9785159975426]
State-of-the-art question answering models remain susceptible to a variety of adversarial attacks and are still far from obtaining human-level language understanding.
One proposed way forward is dynamic adversarial data collection, in which a human annotator attempts to create examples for which a model-in-the-loop fails.
In this work, we investigate several answer selection, question generation, and filtering methods that form a synthetic adversarial data generation pipeline.
Models trained on both synthetic and human-generated data outperform models not trained on synthetic adversarial data, and obtain state-of-the-art results on the Adversarial
arXiv Detail & Related papers (2021-04-18T02:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.