Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2408.00744v1
- Date: Thu, 1 Aug 2024 17:48:08 GMT
- Title: Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
- Authors: Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi,
- Abstract summary: This paper introduces a collaborative vision-text optimizing mechanism within the Open-Vocabulary encoder (OVS) field.
To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field.
In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively.
- Score: 82.95830628372845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .
Related papers
- Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP [15.48717971754816]
InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02%$ to $80.21%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP.
arXiv Detail & Related papers (2023-10-30T17:22:02Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - ICPC: Instance-Conditioned Prompting with Contrastive Learning for
Semantic Segmentation [26.25673603166731]
Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance.
We focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function.
We propose an align-guided contrastive loss to refine the alignment of vision and text embeddings.
arXiv Detail & Related papers (2023-08-14T11:21:47Z) - CREPE: Learnable Prompting With CLIP Improves Visual Relationship
Prediction [30.921126445357118]
We explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships.
Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge.
We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene.
arXiv Detail & Related papers (2023-07-10T18:15:03Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.