Interpreting CLIP's Image Representation via Text-Based Decomposition
- URL: http://arxiv.org/abs/2310.05916v4
- Date: Fri, 29 Mar 2024 03:40:47 GMT
- Title: Interpreting CLIP's Image Representation via Text-Based Decomposition
- Authors: Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt,
- Abstract summary: We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
- Score: 73.54377859089801
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
Related papers
- Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP [53.18562650350898]
We introduce a general framework which can identify the roles of various components in ViTs beyond CLIP.
We also introduce a novel scoring function to rank components by their importance with respect to specific features.
Applying our framework to various ViT variants we gain insights into the roles of different components concerning particular image features.
arXiv Detail & Related papers (2024-06-03T17:58:43Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in
Situation Recognition [20.000253437661]
Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb.
We leverage the CLIP foundational model that has learned the context of images via language descriptions.
Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1% on semantic role labelling.
arXiv Detail & Related papers (2023-07-02T15:05:15Z) - Parts of Speech-Grounded Subspaces in Vision-Language Models [32.497303059356334]
We propose to separate representations of different visual modalities in CLIP's joint vision-language space.
We learn subspaces capturing variability corresponding to a specific part of speech, while minimising variability to the rest.
We show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances.
arXiv Detail & Related papers (2023-05-23T13:32:19Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z) - Injecting Image Details into CLIP's Feature Space [29.450159407113155]
We introduce an efficient framework that can produce a single feature representation for a high-resolution image.
In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method.
We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets.
arXiv Detail & Related papers (2022-08-31T06:18:10Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.