Exploring Visual Interpretability for Contrastive Language-Image
Pre-training
- URL: http://arxiv.org/abs/2209.07046v1
- Date: Thu, 15 Sep 2022 05:01:03 GMT
- Title: Exploring Visual Interpretability for Contrastive Language-Image
Pre-training
- Authors: Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, Xiaomeng Li
- Abstract summary: Contrastive Language-Image pre-training learns rich representations via readily available supervisions of natural language.
Visual interpretability of CLIP has not been studied yet.
We integrate above methods as Interpretable Contrastive Language-Image pre-training (ICLIP)
- Score: 23.569964756096986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image pre-training (CLIP) learns rich representations
via readily available supervisions of natural language. It could improve
general performance on downstream vision tasks, including but not limited to
zero-shot, long tail, segmentation, retrieval, caption and video. However, to
the best of our knowledge, the visual interpretability of CLIP has not been
studied yet. To provide visual explanations of its predictions, we propose the
Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP
prefers the background regions than the foregrounds, and presenting erroneous
visualization against human understanding. Experimentally, we find the devil is
in the pooling part, where inappropriate pooling methods lead to a phenomenon
called semantic shift. To correct and boost the visualization results, we
propose the Masked Max Pooling, with attention map from the self-supervised
image encoder. Meanwhile, interpretability task and recognition task require
different representations. To address the problem, we propose the dual
projections to cater this requirement. We integrate above methods as
Interpretable Contrastive Language-Image pre-training (ICLIP). And experiments
suggest ICLIP greatly improves the interpretability. For example, the
nontrivial improvements are $32.85\%$ and $49.10\%$, respectively, on VOC 2012
dataset.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos [19.08882495584709]
We show that it is possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties.
We adapt CLIP efficiently on a high-quality, comprehensive, and relatively small dataset.
We learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented.
arXiv Detail & Related papers (2024-01-15T13:27:34Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.