CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
- URL: http://arxiv.org/abs/2403.04547v1
- Date: Thu, 7 Mar 2024 14:43:17 GMT
- Title: CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
- Authors: Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, Priya Goyal,
Alexander D'Amour, Xiaohua Zhai
- Abstract summary: We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP)
We present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases.
- Score: 72.19502317793133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the effectiveness of data-balancing for mitigating biases in
contrastive language-image pretraining (CLIP), identifying areas of strength
and limitation. First, we reaffirm prior conclusions that CLIP models can
inadvertently absorb societal stereotypes. To counter this, we present a novel
algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both
representation and association biases (i.e. in first- and second-order
statistics) in multimodal data. We use M4 to conduct an in-depth analysis
taking into account various factors, such as the model, representation, and
data size. Our study also explores the dynamic nature of how CLIP learns and
unlearns biases. In particular, we find that fine-tuning is effective in
countering representation biases, though its impact diminishes for association
biases. Also, data balancing has a mixed impact on quality: it tends to improve
classification but can hurt retrieval. Interestingly, data and architectural
improvements seem to mitigate the negative impact of data balancing on
performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves
COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and
ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with
recommendations for improving the efficacy of data balancing in multimodal
systems.
Related papers
- Conformal-in-the-Loop for Learning with Imbalanced Noisy Data [5.69777817429044]
Class imbalance and label noise are pervasive in large-scale datasets.
Much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions.
We propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach.
arXiv Detail & Related papers (2024-11-04T17:09:58Z) - Safe Semi-Supervised Contrastive Learning Using In-Distribution Data as Positive Examples [3.4546761246181696]
We propose a self-supervised contrastive learning approach to fully exploit a large amount of unlabeled data.
The results show that self-supervised contrastive learning significantly improves classification accuracy.
arXiv Detail & Related papers (2024-08-03T22:33:13Z) - Understanding the Detrimental Class-level Effects of Data Augmentation [63.1733767714073]
achieving optimal average accuracy comes at the cost of significantly hurting individual class accuracy by as much as 20% on ImageNet.
We present a framework for understanding how DA interacts with class-level learning dynamics.
We show that simple class-conditional augmentation strategies improve performance on the negatively affected classes.
arXiv Detail & Related papers (2023-12-07T18:37:43Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - FITNESS: A Causal De-correlation Approach for Mitigating Bias in Machine
Learning Software [6.4073906779537095]
Biased datasets can lead to unfair and potentially harmful outcomes.
In this paper, we propose a bias mitigation approach via de-correlating the causal effects between sensitive features and the label.
Our key idea is that by de-correlating such effects from a causality perspective, the model would avoid making predictions based on sensitive features.
arXiv Detail & Related papers (2023-05-23T06:24:43Z) - Cross Pairwise Ranking for Unbiased Item Recommendation [57.71258289870123]
We develop a new learning paradigm named Cross Pairwise Ranking (CPR)
CPR achieves unbiased recommendation without knowing the exposure mechanism.
We prove in theory that this way offsets the influence of user/item propensity on the learning.
arXiv Detail & Related papers (2022-04-26T09:20:27Z) - Does Data Repair Lead to Fair Models? Curating Contextually Fair Data To
Reduce Model Bias [10.639605996067534]
Contextual information is a valuable cue for Deep Neural Networks (DNNs) to learn better representations and improve accuracy.
In COCO, many object categories have a much higher co-occurrence with men compared to women, which can bias a DNN's prediction in favor of men.
We introduce a data repair algorithm using the coefficient of variation, which can curate fair and contextually balanced data for a protected class.
arXiv Detail & Related papers (2021-10-20T06:00:03Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.