Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation
- URL: http://arxiv.org/abs/2412.02837v1
- Date: Tue, 03 Dec 2024 21:02:14 GMT
- Title: Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation
- Authors: Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky, Rogerio Feris, Yunhui Guo,
- Abstract summary: We show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time.<n>We propose framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions.<n>We obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.
- Score: 18.278043899825267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.
Related papers
- Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z) - un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP [75.19266107565109]
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks.<n>This work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible.
arXiv Detail & Related papers (2025-05-30T12:29:38Z) - Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation [4.063715077687089]
Distill CLIP (DCLIP) is a fine-tuned variant of the CLIP model.<n>It enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities.
arXiv Detail & Related papers (2025-05-25T07:08:07Z) - Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP [22.33658954569737]
We build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component and a Text-Guided-Image (TGI) component.
Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method.
We propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost.
arXiv Detail & Related papers (2024-12-16T02:03:45Z) - TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [53.91006249339802]
We propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks.
TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP.
We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets.
arXiv Detail & Related papers (2024-11-20T08:58:59Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Understanding the Vulnerability of CLIP to Image Compression [26.536819387473482]
We show that CLIP is vulnerable to change in image quality under compression.
We evaluate this vulnerability extensively on CIFAR-10 and STL-10.
arXiv Detail & Related papers (2023-11-23T14:33:53Z) - Linear Alignment of Vision-language Models for Image Captioning [8.921774238325566]
We propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods.
We also propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment.
arXiv Detail & Related papers (2023-07-10T17:59:21Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - Expressive Losses for Verified Robustness via Convex Combinations [67.54357965665676]
We study the relationship between the over-approximation coefficient and performance profiles across different expressive losses.
We show that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
arXiv Detail & Related papers (2023-05-23T12:20:29Z) - Context-Aware Robust Fine-Tuning [23.027441849817922]
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]"
Fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks.
We propose Context-Aware Robust Fine-tuning (CAR-FT) to solve this problem.
arXiv Detail & Related papers (2022-11-29T13:07:41Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension [114.85628613911713]
Large-scale pre-trained models are useful for image classification across domains.
We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC.
arXiv Detail & Related papers (2022-04-12T17:55:38Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.