Competency Problems: On Finding and Removing Artifacts in Language Data
- URL: http://arxiv.org/abs/2104.08646v1
- Date: Sat, 17 Apr 2021 21:34:10 GMT
- Title: Competency Problems: On Finding and Removing Artifacts in Language Data
- Authors: Matt Gardner, William Merrill, Jesse Dodge, Matthew E. Peters, Alexis
Ross, Sameer Singh, Noah Smith
- Abstract summary: We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
- Score: 50.09608320112584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Much recent work in NLP has documented dataset artifacts, bias, and spurious
correlations between input features and output labels. However, how to tell
which features have "spurious" instead of legitimate correlations is typically
left unspecified. In this work we argue that for complex language understanding
tasks, all simple feature correlations are spurious, and we formalize this
notion into a class of problems which we call competency problems. For example,
the word "amazing" on its own should not give information about a sentiment
label independent of the context in which it appears, which could include
negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of
creating data for competency problems when human bias is taken into account,
showing that realistic datasets will increasingly deviate from competency
problems as dataset size increases. This analysis gives us a simple statistical
test for dataset artifacts, which we use to show more subtle biases than were
described in prior work, including demonstrating that models are
inappropriately affected by these less extreme biases. Our theoretical
treatment of this problem also allows us to analyze proposed solutions, such as
making local edits to dataset instances, and to give recommendations for future
data collection and model design efforts that target competency problems.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - A Study on Bias Detection and Classification in Natural Language Processing [2.908482270923597]
The aim of our work is to determine how to better combine publicly-available datasets to train models in the task of hate speech detection and classification.
We discuss these issues in tandem with the development of our experiments, in which we show that the combinations of different datasets greatly impact the models' performance.
arXiv Detail & Related papers (2024-08-14T11:49:24Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Less Learn Shortcut: Analyzing and Mitigating Learning of Spurious
Feature-Label Correlation [44.319739489968164]
Deep neural networks often take dataset biases as a shortcut to make decisions rather than understand tasks.
In this study, we focus on the spurious correlation between word features and labels that models learn from the biased data distribution.
We propose a training strategy Less-Learn-Shortcut (LLS): our strategy quantifies the biased degree of the biased examples and down-weights them accordingly.
arXiv Detail & Related papers (2022-05-25T09:08:35Z) - Representation Bias in Data: A Survey on Identification and Resolution
Techniques [26.142021257838564]
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
arXiv Detail & Related papers (2022-03-22T16:30:22Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - Evaluating Factuality in Generation with Dependency-level Entailment [57.5316011554622]
We propose a new formulation of entailment that decomposes it at the level of dependency arcs.
We show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods.
arXiv Detail & Related papers (2020-10-12T06:43:10Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.