VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
- URL: http://arxiv.org/abs/2112.02399v3
- Date: Thu, 10 Aug 2023 15:31:54 GMT
- Title: VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
- Authors: Longtian Qiu, Renrui Zhang, Ziyu Guo, Ziyao Zeng, Zilu Guo, Yafeng Li,
Guangnan Zhang
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
- Score: 2.0434814235659555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention
recently for its transferable visual representation learning. However, due to
the semantic gap within datasets, CLIP's pre-trained image-text alignment
becomes sub-optimal on downstream tasks, which severely harms its transferring
performance. To better adapt the cross-modality embedding space, we propose to
enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide
textual features of different categories to adaptively explore informative
regions on the image and aggregate visual features by attention mechanisms. In
this way, the texts become visual-guided, namely, more semantically correlated
with downstream images, which greatly benefits the category-wise matching
process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known
classification datasets to demonstrate its effectiveness.
Related papers
- Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual
representations [4.588028371034406]
We propose ContextCLIP, a contextual and contrastive learning framework for the contextual alignment of image-text pairs.
Our framework was observed to improve the image-text alignment by aligning text and image representations contextually in the joint embedding space.
ContextCLIP showed good qualitative performance for text-to-image retrieval tasks and enhanced classification accuracy.
arXiv Detail & Related papers (2022-11-14T05:17:51Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.