Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
- URL: http://arxiv.org/abs/2410.08469v2
- Date: Wed, 16 Oct 2024 14:09:14 GMT
- Title: Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
- Authors: Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon,
- Abstract summary: A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images.
We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI)
SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance.
- Score: 46.53595526049201
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.
Related papers
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - Uncovering the Text Embedding in Text-to-Image Diffusion Models [17.108496821429494]
Text embedding, as the pivotal intermediary between text and images, remains relatively underexplored.
We identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding.
We find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition.
arXiv Detail & Related papers (2024-04-01T14:59:13Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual
representations [4.588028371034406]
We propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs.
Our framework was observed to improve the image-text alignment by aligning text and image representations contextually in the joint embedding space.
ContextCLIP showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy.
arXiv Detail & Related papers (2022-11-14T05:17:51Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.