A Decade's Battle on Dataset Bias: Are We There Yet?
- URL: http://arxiv.org/abs/2403.08632v1
- Date: Wed, 13 Mar 2024 15:46:37 GMT
- Title: A Decade's Battle on Dataset Bias: Are We There Yet?
- Authors: Zhuang Liu, Kaiming He
- Abstract summary: We revisit the "dataset classification" experiment suggested by Torralba and Efros a decade ago.
Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from.
- Score: 32.46064586176908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit the "dataset classification" experiment suggested by Torralba and
Efros a decade ago, in the new era with large-scale, diverse, and hopefully
less biased datasets as well as more capable neural network architectures.
Surprisingly, we observe that modern neural networks can achieve excellent
accuracy in classifying which dataset an image is from: e.g., we report 84.7%
accuracy on held-out validation data for the three-way classification problem
consisting of the YFCC, CC, and DataComp datasets. Our further experiments show
that such a dataset classifier could learn semantic features that are
generalizable and transferable, which cannot be simply explained by
memorization. We hope our discovery will inspire the community to rethink the
issue involving dataset bias and model capabilities.
Related papers
- Fuzzy Convolution Neural Networks for Tabular Data Classification [0.0]
Convolutional neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains.
In this paper, we propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data.
arXiv Detail & Related papers (2024-06-04T20:33:35Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - UnbiasedNets: A Dataset Diversification Framework for Robustness Bias
Alleviation in Neural Networks [11.98126285848966]
Even the most accurate NNs can be biased toward a specific output classification due to the inherent bias in the available training datasets.
This paper deals with the robustness bias, i.e., the bias exhibited by the trained NN by having a significantly large robustness to noise for a certain output class.
We propose the UnbiasedNets framework, which leverages K-means clustering and the NN's noise tolerance to diversify the given training dataset.
arXiv Detail & Related papers (2023-02-24T09:49:43Z) - Multi-layer Representation Learning for Robust OOD Image Classification [3.1372269816123994]
We argue that extracting features from a CNN's intermediate layers can assist in the model's final prediction.
Specifically, we adapt the Hypercolumns method to a ResNet-18 and find a significant increase in the model's accuracy, when evaluating on the NICO dataset.
arXiv Detail & Related papers (2022-07-27T17:46:06Z) - Do We Really Need a Learnable Classifier at the End of Deep Neural
Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training.
Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Does Data Repair Lead to Fair Models? Curating Contextually Fair Data To
Reduce Model Bias [10.639605996067534]
Contextual information is a valuable cue for Deep Neural Networks (DNNs) to learn better representations and improve accuracy.
In COCO, many object categories have a much higher co-occurrence with men compared to women, which can bias a DNN's prediction in favor of men.
We introduce a data repair algorithm using the coefficient of variation, which can curate fair and contextually balanced data for a protected class.
arXiv Detail & Related papers (2021-10-20T06:00:03Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - A Note on Data Biases in Generative Models [16.86600007830682]
We investigate the impact of dataset quality on the performance of generative models.
We show how societal biases of datasets are replicated by generative models.
We present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes.
arXiv Detail & Related papers (2020-12-04T10:46:37Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.