Identifying Statistical Bias in Dataset Replication
- URL: http://arxiv.org/abs/2005.09619v2
- Date: Wed, 2 Sep 2020 06:38:04 GMT
- Title: Identifying Statistical Bias in Dataset Replication
- Authors: Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras,
Jacob Steinhardt, Aleksander Madry
- Abstract summary: We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
- Score: 102.92137353938388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset replication is a useful tool for assessing whether improvements in
test accuracy on a specific benchmark correspond to improvements in models'
ability to generalize reliably. In this work, we present unintuitive yet
significant ways in which standard approaches to dataset replication introduce
statistical bias, skewing the resulting observations. We study ImageNet-v2, a
replication of the ImageNet dataset on which models exhibit a significant
(11-14%) drop in accuracy, even after controlling for a standard
human-in-the-loop measure of data quality. We show that after correcting for
the identified statistical bias, only an estimated $3.6\% \pm 1.5\%$ of the
original $11.7\% \pm 1.0\%$ accuracy drop remains unaccounted for. We conclude
with concrete recommendations for recognizing and avoiding bias in dataset
replication. Code for our study is publicly available at
http://github.com/MadryLab/dataset-replication-analysis .
Related papers
- TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - A Principled Evaluation Protocol for Comparative Investigation of the
Effectiveness of DNN Classification Models on Similar-but-non-identical
Datasets [11.735794237408427]
We show that Deep Neural Network (DNN) models show significant, consistent, and largely unexplained degradation in accuracy on replication test datasets.
We propose a principled evaluation protocol that is suitable for performing comparative investigations of the accuracy of a DNN model on multiple test datasets.
Our experimental results indicate that the observed accuracy degradation between established benchmark datasets and their replications is consistently lower.
arXiv Detail & Related papers (2022-09-05T09:14:43Z) - Certifying Data-Bias Robustness in Linear Regression [12.00314910031517]
We present a technique for certifying whether linear regression models are pointwise-robust to label bias in a training dataset.
We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method.
We also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets.
arXiv Detail & Related papers (2022-06-07T20:47:07Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact
Verification Models [14.75693099720436]
We propose CrossAug, a contrastive data augmentation method for debiasing fact verification models.
We employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples.
The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns.
arXiv Detail & Related papers (2021-09-30T13:19:19Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Are Labels Always Necessary for Classifier Accuracy Evaluation? [28.110519483540482]
We aim to estimate the classification accuracy on unlabeled test datasets.
We construct a meta-dataset comprised of datasets generated from the original images.
As the classification accuracy of the model on each sample (dataset) is known from the original dataset labels, our task can be solved via regression.
arXiv Detail & Related papers (2020-07-06T17:45:39Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.