Related papers: Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

URL: http://arxiv.org/abs/2406.01583v1
Date: Mon, 3 Jun 2024 17:58:43 GMT
Title: Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi,
Abstract summary: We introduce a framework which can identify the roles of various components in arbitrary vision transformers (ViTs) Specifically, we automate the decomposition of the final representation into contributions from different model components. We also introduce a novel scoring function to rank components by their importance with respect to specific features.
Score: 53.18562650350898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features.These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations.

Related papers

OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings. Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [49.919635694894204]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL) We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions.
arXiv Detail & Related papers (2024-08-19T08:23:09Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes. Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z)
Disentangling Structure and Appearance in ViT Feature Space [26.233355454282446]
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. We propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain.
arXiv Detail & Related papers (2023-11-20T21:20:15Z)
Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads. We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z)
Hierarchical Open-vocabulary Universal Image Segmentation [48.008887320870244]
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff" Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
arXiv Detail & Related papers (2023-07-03T06:02:15Z)
Splicing ViT Features for Semantic Appearance Transfer [10.295754142142686]
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
arXiv Detail & Related papers (2022-01-02T22:00:34Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.