Calibrating Multi-modal Representations: A Pursuit of Group Robustness
without Annotations
- URL: http://arxiv.org/abs/2403.07241v1
- Date: Tue, 12 Mar 2024 01:47:17 GMT
- Title: Calibrating Multi-modal Representations: A Pursuit of Group Robustness
without Annotations
- Authors: Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S. Sekhon, Lawrence
Staib, James S. Duncan
- Abstract summary: Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks.
These tuned models tend to become highly specialized, limiting their practicality for real-world deployment.
We propose a lightweight representation calibration method for fine-tuning CLIP.
- Score: 20.981354848227912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning pre-trained vision-language models, like CLIP, has yielded
success on diverse downstream tasks. However, several pain points persist for
this paradigm: (i) directly tuning entire pre-trained models becomes both
time-intensive and computationally costly. Additionally, these tuned models
tend to become highly specialized, limiting their practicality for real-world
deployment; (ii) recent studies indicate that pre-trained vision-language
classifiers may overly depend on spurious features -- patterns that correlate
with the target in training data, but are not related to the true labeling
function; and (iii) existing studies on mitigating the reliance on spurious
features, largely based on the assumption that we can identify such features,
does not provide definitive assurance for real-world applications. As a
piloting study, this work focuses on exploring mitigating the reliance on
spurious features for CLIP without using any group annotation. To this end, we
systematically study the existence of spurious correlation on CLIP and
CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR),
verify that last-layer retraining can greatly improve group robustness on
pretrained CLIP. In view of them, we advocate a lightweight representation
calibration method for fine-tuning CLIP, by first generating a calibration set
using the pretrained CLIP, and then calibrating representations of samples
within this set through contrastive learning, all without the need for group
labels. Extensive experiments and in-depth visualizations on several benchmarks
validate the effectiveness of our proposals, largely reducing reliance and
significantly boosting the model generalization.
Related papers
- Robust Calibration of Large Vision-Language Adapters [17.583536041845402]
This paper addresses the critical issue of miscalibration in CLIP-based model adaptation.
We empirically demonstrate that popular CLIP adaptation approaches, such as Adapters, Prompt Learning, and Test-Time Adaptation, substantially degrade the calibration capabilities of the zero-shot baseline.
Motivated by these observations, we present a simple and model-agnostic solution to mitigate miscalibration, by scaling the logit range of each sample to its zero-shot prediction logits.
arXiv Detail & Related papers (2024-07-18T15:27:56Z) - BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models [20.88680592729709]
We propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models.
BaFTA directly estimates class centroids using online clustering within a projected embedding space.
We demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-06-17T08:16:24Z) - AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning [50.78033979438031]
We first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias.
Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification.
arXiv Detail & Related papers (2024-04-13T10:46:11Z) - Bayesian Exploration of Pre-trained Models for Low-shot Image Classification [14.211305168954594]
This work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes.
We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function.
We demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance.
arXiv Detail & Related papers (2024-03-30T10:25:28Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Robust Fine-Tuning of Vision-Language Models for Domain Generalization [6.7181844004432385]
Foundation models have impressive zero-shot inference capabilities and robustness under distribution shifts.
We present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP.
Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts.
arXiv Detail & Related papers (2023-11-03T20:50:40Z) - RanPAC: Random Projections and Pre-trained Models for Continual Learning [59.07316955610658]
Continual learning (CL) aims to learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones.
We propose a concise and effective approach for CL with pre-trained models.
arXiv Detail & Related papers (2023-07-05T12:49:02Z) - Continual Learners are Incremental Model Generalizers [70.34479702177988]
This paper extensively studies the impact of Continual Learning (CL) models as pre-trainers.
We find that the transfer quality of the representation often increases gradually without noticeable degradation in fine-tuning performance.
We propose a new fine-tuning scheme, GLobal Attention Discretization (GLAD), that preserves rich task-generic representation during solving downstream tasks.
arXiv Detail & Related papers (2023-06-21T05:26:28Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Evaluating CLIP: Towards Characterization of Broader Capabilities and
Downstream Implications [8.15254368157658]
We analyze CLIP and highlight some of the challenges such models pose.
We find that CLIP can inherit biases found in prior computer vision systems.
These results add evidence to the growing body of work calling for a change in the notion of a 'better' model.
arXiv Detail & Related papers (2021-08-05T19:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.