Beat the AI: Investigating Adversarial Human Annotation for Reading
Comprehension
- URL: http://arxiv.org/abs/2002.00293v2
- Date: Tue, 22 Sep 2020 16:02:10 GMT
- Title: Beat the AI: Investigating Adversarial Human Annotation for Reading
Comprehension
- Authors: Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel,
Pontus Stenetorp
- Abstract summary: Humans create questions adversarially, such that the model fails to answer them correctly.
We collect 36,000 samples with progressively stronger models in the annotation loop.
We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets.
We find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
- Score: 27.538957000237176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Innovations in annotation methodology have been a catalyst for Reading
Comprehension (RC) datasets and models. One recent trend to challenge current
RC models is to involve a model in the annotation process: humans create
questions adversarially, such that the model fails to answer them correctly. In
this work we investigate this annotation methodology and apply it in three
different settings, collecting a total of 36,000 samples with progressively
stronger models in the annotation loop. This allows us to explore questions
such as the reproducibility of the adversarial effect, transfer from data
collected with varying model-in-the-loop strengths, and generalisation to data
collected without a model. We find that training on adversarially collected
samples leads to strong generalisation to non-adversarially collected datasets,
yet with progressive performance deterioration with increasingly stronger
models-in-the-loop. Furthermore, we find that stronger models can still learn
from datasets collected with substantially weaker models-in-the-loop. When
trained on data collected with a BiDAF model in the loop, RoBERTa achieves
39.9F1 on questions that it cannot answer when trained on SQuAD - only
marginally lower than when trained on data collected using RoBERTa itself
(41.0F1).
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models [14.023953508288628]
Retrieval augmented generation (RAG) pipelines are commonly used in tasks such as question-answering (QA)
We propose REFINE, a novel technique that generates synthetic data from available documents and then uses a model fusion approach to fine-tune embeddings.
arXiv Detail & Related papers (2024-10-16T08:43:39Z) - Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems [17.10762463903638]
We train evaluation models to approximate human evaluation, achieving high agreement.
We propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model.
arXiv Detail & Related papers (2024-06-26T10:48:14Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Improving Question Answering Model Robustness with Synthetic Adversarial
Data Generation [41.9785159975426]
State-of-the-art question answering models remain susceptible to a variety of adversarial attacks and are still far from obtaining human-level language understanding.
One proposed way forward is dynamic adversarial data collection, in which a human annotator attempts to create examples for which a model-in-the-loop fails.
In this work, we investigate several answer selection, question generation, and filtering methods that form a synthetic adversarial data generation pipeline.
Models trained on both synthetic and human-generated data outperform models not trained on synthetic adversarial data, and obtain state-of-the-art results on the Adversarial
arXiv Detail & Related papers (2021-04-18T02:00:06Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Exposing Shallow Heuristics of Relation Extraction Models with Challenge
Data [49.378860065474875]
We identify failure modes of SOTA relation extraction (RE) models trained on TACRED.
By adding some of the challenge data as training examples, the performance of the model improves.
arXiv Detail & Related papers (2020-10-07T21:17:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.