Related papers: CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

URL: http://arxiv.org/abs/2507.14312v1
Date: Fri, 18 Jul 2025 18:32:17 GMT
Title: CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation
Authors: Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, Nicolas Thome,
Abstract summary: Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts.<n>Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization.<n>We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective.
Score: 15.746085775084234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.

Related papers

Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation [18.82879703518279]
Existing training-free test-time adaptation methods operate strictly within CLIP's original feature space.<n>We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift.
arXiv Detail & Related papers (2025-07-01T06:22:00Z)
FRET: Feature Redundancy Elimination for Test Time Adaptation [4.793572485305334]
Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data.<n>In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA.<n>We introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA.
arXiv Detail & Related papers (2025-05-15T18:22:56Z)
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning [27.467732819969935]
This paper attempts to explore the impact of large-scale visual foundation models on Long-Tailed Semi-Supervised Learning (LTSSL)<n>We employ three strategies: Linear Probing (LP), Lightweight Fine-Tuning (LFT), and Full Fine-Tuning (FFT)<n>Our analysis presents the following insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in a decline in model performance, whereas LP and LFT, although boosting overall model performance, exhibit negligible benefits to tail classes.<n>We propose a Unbiased Lightweight Fine-tuning strategy, textbfULFine
arXiv Detail & Related papers (2025-05-08T08:54:57Z)
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning [97.49610356913874]
We propose a robust test-time prompt tuning (R-TPT) for vision-language models (VLMs)<n>R-TPT mitigates the impact of adversarial attacks during the inference stage.<n>We introduce a plug-and-play reliability-based weighted ensembling strategy to strengthen the defense.
arXiv Detail & Related papers (2025-04-15T13:49:31Z)
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [53.91006249339802]
We propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets.
arXiv Detail & Related papers (2024-11-20T08:58:59Z)
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images [16.0258685984844]
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. We propose a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception.
arXiv Detail & Related papers (2024-07-19T12:22:32Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks. Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance. In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning [94.35586521144117]
We investigate whether applying contrastive learning to fine-tuning would bring further benefits. We propose Contrast-regularized tuning (Core-tuning), a novel approach for fine-tuning contrastive self-supervised visual models.
arXiv Detail & Related papers (2021-02-12T16:31:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.