Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
- URL: http://arxiv.org/abs/2303.05470v3
- Date: Mon, 12 Jun 2023 14:04:53 GMT
- Title: Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
- Authors: Aengus Lynch, Gb\`etondji J-S Dovonon, Jean Kaddour, Ricardo Silva
- Abstract summary: We present benchmark-O2O, M2M-Easy, Medium, Hard, an image classification benchmark suite containing spurious correlations between classes and backgrounds.
The resulting dataset is of high quality and contains approximately 152k images.
- Score: 8.455991178281469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of spurious correlations (SCs) arises when a classifier relies on
non-predictive features that happen to be correlated with the labels in the
training data. For example, a classifier may misclassify dog breeds based on
the background of dog images. This happens when the backgrounds are correlated
with other breeds in the training data, leading to misclassifications during
test time. Previous SC benchmark datasets suffer from varying issues, e.g.,
over-saturation or only containing one-to-one (O2O) SCs, but no many-to-many
(M2M) SCs arising between groups of spurious attributes and classes. In this
paper, we present \benchmark-\{O2O, M2M\}-\{Easy, Medium, Hard\}, an image
classification benchmark suite containing spurious correlations between classes
and backgrounds. To create this dataset, we employ a text-to-image model to
generate photo-realistic images and an image captioning model to filter out
unsuitable ones. The resulting dataset is of high quality and contains
approximately 152k images. Our experimental results demonstrate that
state-of-the-art group robustness methods struggle with \benchmark, most
notably on the Hard-splits with none of them getting over $70\%$ accuracy on
the hardest split using a ResNet50 pretrained on ImageNet. By examining model
misclassifications, we detect reliances on spurious backgrounds, demonstrating
that our dataset provides a significant challenge.
Related papers
- Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation [2.273629240935727]
We propose Decompose-and-Compose (DaC) to improve correlation shift by combining elements of images.
Based on our observations, models trained with Empirical Risk Minimization (ERM) usually highly attend to either the causal components or the components having a high spurious correlation with the label.
We propose a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training.
arXiv Detail & Related papers (2024-02-29T07:24:24Z) - Common-Sense Bias Discovery and Mitigation for Classification Tasks [16.8259488742528]
We propose a framework to extract feature clusters in a dataset based on image descriptions.
The analyzed features and correlations are human-interpretable, so we name the method Common-Sense Bias Discovery (CSBD)
Experiments show that our method discovers novel biases on multiple classification tasks for two benchmark image datasets.
arXiv Detail & Related papers (2024-01-24T03:56:07Z) - Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object
Classification [47.64219291655723]
We introduce a new test set, called D2O, which is sufficiently different from existing test sets.
Our dataset contains 8,060 images spread across 36 categories, out of which 29 appear in ImageNet.
The best Top-1 accuracy on our dataset is around 60% which is much lower than 91% best Top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2023-01-29T19:58:32Z) - Better May Not Be Fairer: A Study on Subgroup Discrepancy in Image
Classification [73.87160347728314]
We investigate how natural background colors play a role as spurious features by annotating the test sets of CIFAR10 and CIFAR100 into subgroups based on the background color of each image.
We find that overall human-level accuracy does not guarantee consistent subgroup performances, and the phenomenon remains even on models pre-trained on ImageNet or after data augmentation (DA)
Experimental results show that FlowAug achieves more consistent subgroup results than other types of DA methods on CIFAR10/100 and on CIFAR10/100-C.
arXiv Detail & Related papers (2022-12-16T18:51:10Z) - Invariant Learning via Diffusion Dreamed Distribution Shifts [121.71383835729848]
We propose a dataset called Diffusion Dreamed Distribution Shifts (D3S)
D3S consists of synthetic images generated through StableDiffusion using text prompts and image guides obtained by pasting a sample foreground image onto a background template image.
Due to the incredible photorealism of the diffusion model, our images are much closer to natural images than previous synthetic datasets.
arXiv Detail & Related papers (2022-11-18T17:07:43Z) - Learning to Annotate Part Segmentation with Gradient Matching [58.100715754135685]
This paper focuses on tackling semi-supervised part segmentation tasks by generating high-quality images with a pre-trained GAN.
In particular, we formulate the annotator learning as a learning-to-learn problem.
We show that our method can learn annotators from a broad range of labelled images including real images, generated images, and even analytically rendered images.
arXiv Detail & Related papers (2022-11-06T01:29:22Z) - Understanding out-of-distribution accuracies through quantifying
difficulty of test samples [10.266928164137635]
Existing works show that although modern neural networks achieve remarkable generalization performance on the in-distribution (ID) dataset, the accuracy drops significantly on the out-of-distribution (OOD) datasets.
We propose a new metric to quantify the difficulty of the test images (either ID or OOD) that depends on the interaction of the training dataset and the model.
arXiv Detail & Related papers (2022-03-28T21:13:41Z) - Free Lunch for Co-Saliency Detection: Context Adjustment [14.688461235328306]
We propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples.
We collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively.
arXiv Detail & Related papers (2021-08-04T14:51:37Z) - Background Splitting: Finding Rare Classes in a Sea of Background [55.03789745276442]
We focus on the real-world problem of training accurate deep models for image classification of a small number of rare categories.
In these scenarios, almost all images belong to the background category in the dataset (>95% of the dataset is background)
We demonstrate that both standard fine-tuning approaches and state-of-the-art approaches for training on imbalanced datasets do not produce accurate deep models in the presence of this extreme imbalance.
arXiv Detail & Related papers (2020-08-28T23:05:15Z) - I Am Going MAD: Maximum Discrepancy Competition for Comparing
Classifiers Adaptively [135.7695909882746]
We name the MAximum Discrepancy (MAD) competition.
We adaptively sample a small test set from an arbitrarily large corpus of unlabeled images.
Human labeling on the resulting model-dependent image sets reveals the relative performance of the competing classifiers.
arXiv Detail & Related papers (2020-02-25T03:32:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.