Understanding the Vulnerability of CLIP to Image Compression
- URL: http://arxiv.org/abs/2311.14029v1
- Date: Thu, 23 Nov 2023 14:33:53 GMT
- Title: Understanding the Vulnerability of CLIP to Image Compression
- Authors: Cangxiong Chen, Vinay P. Namboodiri, Julian Padget
- Abstract summary: We show that CLIP is vulnerable to change in image quality under compression.
We evaluate this vulnerability extensively on CIFAR-10 and STL-10.
- Score: 26.536819387473482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: CLIP is a widely used foundational vision-language model that is used for
zero-shot image recognition and other image-text alignment tasks. We
demonstrate that CLIP is vulnerable to change in image quality under
compression. This surprising result is further analysed using an attribution
method-Integrated Gradients. Using this attribution method, we are able to
better understand both quantitatively and qualitatively exactly the nature in
which the compression affects the zero-shot recognition accuracy of this model.
We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the
basis to understand this vulnerability of CLIP and can help us develop more
effective methods to improve the robustness of CLIP and other vision-language
models.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - ExIQA: Explainable Image Quality Assessment Using Distortion Attributes [0.3683202928838613]
We propose an explainable approach for distortion identification based on attribute learning.
We generate a dataset consisting of 100,000 images for efficient training.
Our approach achieves state-of-the-art (SOTA) performance across multiple datasets in both PLCC and SRCC metrics.
arXiv Detail & Related papers (2024-09-10T20:28:14Z) - Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP [0.0]
We focus on CLIP, a model renowned for its integration of vision and language processing.
Our objective is to uncover recurring problems and blind spots in CLIP's image comprehension.
We reveal significant discrepancies in CLIP's interpretation of images compared to human perception.
arXiv Detail & Related papers (2024-06-30T05:23:11Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Context-Aware Robust Fine-Tuning [23.027441849817922]
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]"
Fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks.
We propose Context-Aware Robust Fine-tuning (CAR-FT) to solve this problem.
arXiv Detail & Related papers (2022-11-29T13:07:41Z) - ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension [114.85628613911713]
Large-scale pre-trained models are useful for image classification across domains.
We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC.
arXiv Detail & Related papers (2022-04-12T17:55:38Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.