Related papers: On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

URL: http://arxiv.org/abs/2204.12708v1
Date: Wed, 27 Apr 2022 05:42:40 GMT
Title: On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Authors: Roy Schwartz and Gabriel Stanovsky
Abstract summary: Deep learning models are sensitive to low-level correlations between simple features and specific output labels. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances. But even balancing all single-word features is insufficient for mitigating all of these correlations.
Score: 17.709208772225512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Related papers

Error Distribution Smoothing:Advancing Low-Dimensional Imbalanced Regression [2.435853975142516]
In real-world regression tasks, datasets frequently exhibit imbalanced distributions, characterized by a scarcity of data in high-complexity regions and an abundance in low-complexity areas. We introduce a novel concept of Imbalanced Regression, which takes into account both the complexity of the problem and the density of data points, extending beyond traditional definitions that focus only on data density. We propose Error Distribution Smoothing (EDS) as a solution to tackle imbalanced regression, effectively selecting a representative subset from the dataset to reduce redundancy while maintaining balance and representativeness.
arXiv Detail & Related papers (2025-02-04T12:40:07Z)
Autoencoder based approach for the mitigation of spurious correlations [2.7624021966289605]
Spurious correlations refer to erroneous associations in data that do not reflect true underlying relationships. These correlations can lead deep neural networks (DNNs) to learn patterns that are not robust across diverse datasets or real-world scenarios. We propose an autoencoder-based approach to analyze the nature of spurious correlations that exist in the Global Wheat Head Detection (GWHD) 2021 dataset.
arXiv Detail & Related papers (2024-06-27T05:28:44Z)
Unsupervised Concept Discovery Mitigates Spurious Correlations [45.48778210340187]
Models prone to spurious correlations in training data often produce brittle predictions and introduce unintended biases. In this paper, we establish a novel connection between unsupervised object-centric learning and mitigation of spurious correlations. We introduce CoBalT: a concept balancing technique that effectively mitigates spurious correlations without requiring human labeling of subgroups.
arXiv Detail & Related papers (2024-02-20T20:48:00Z)
Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors. We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Towards Mitigating more Challenging Spurious Correlations: A Benchmark & New Datasets [43.64631697043496]
Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation. We present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation.
arXiv Detail & Related papers (2023-06-21T00:59:06Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
Towards Integration of Discriminability and Robustness for Document-Level Relation Extraction [41.51148745387936]
Document-level relation extraction (DocRE) predicts relations for entity pairs that rely on long-range context-dependent reasoning in a document. In this work, we aim to achieve better integration of both the discriminability and robustness for the DocRE problem. We innovatively customize entropy minimization and supervised contrastive learning for the challenging multi-label and long-tailed learning problems.
arXiv Detail & Related papers (2023-04-03T09:11:18Z)
Pipelined correlated minimum weight perfect matching of the surface code [56.01788646782563]
We describe a pipeline approach to decoding the surface code using minimum weight perfect matching. An independent no-communication parallelizable processing stage reweights the graph according to likely correlations. A later general stage finishes the matching. We validate the new algorithm on the fully fault-tolerant toric, unrotated, and rotated surface codes.
arXiv Detail & Related papers (2022-05-19T19:58:02Z)
Disentanglement and Generalization Under Correlation Shifts [22.499106910581958]
Correlations between factors of variation are prevalent in real-world data. Machine learning algorithms may benefit from exploiting such correlations, as they can increase predictive performance on noisy data. We aim to learn representations which capture different factors of variation in latent subspaces.
arXiv Detail & Related papers (2021-12-29T18:55:17Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.