ICPC: Instance-Conditioned Prompting with Contrastive Learning for
Semantic Segmentation
- URL: http://arxiv.org/abs/2308.07078v1
- Date: Mon, 14 Aug 2023 11:21:47 GMT
- Title: ICPC: Instance-Conditioned Prompting with Contrastive Learning for
Semantic Segmentation
- Authors: Chaohui Yu, Qiang Zhou, Zhibin Wang, Fan Wang
- Abstract summary: Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance.
We focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function.
We propose an align-guided contrastive loss to refine the alignment of vision and text embeddings.
- Score: 26.25673603166731
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern supervised semantic segmentation methods are usually finetuned based
on the supervised or self-supervised models pre-trained on ImageNet. Recent
work shows that transferring the knowledge from CLIP to semantic segmentation
via prompt learning can achieve promising performance. The performance boost
comes from the feature enhancement with multimodal alignment, i.e., the dot
product between vision and text embeddings. However, how to improve the
multimodal alignment for better transfer performance in dense tasks remains
underexplored. In this work, we focus on improving the quality of vision-text
alignment from two aspects of prompting design and loss function, and present
an instance-conditioned prompting with contrastive learning (ICPC) framework.
First, compared with the static prompt designs, we reveal that dynamic
prompting conditioned on image content can more efficiently utilize the text
encoder for complex dense tasks. Second, we propose an align-guided contrastive
loss to refine the alignment of vision and text embeddings. We further propose
lightweight multi-scale alignment for better performance. Extensive experiments
on three large-scale datasets (ADE20K, COCO-Stuff10k, and ADE20K-Full)
demonstrate that ICPC brings consistent improvements across diverse backbones.
Taking ResNet-50 as an example, ICPC outperforms the state-of-the-art
counterpart by 1.71%, 1.05%, and 1.41% mIoU on the three datasets,
respectively.
Related papers
- Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation [82.95830628372845]
This paper introduces a collaborative vision-text optimizing mechanism within the Open-Vocabulary encoder (OVS) field.
To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field.
In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively.
arXiv Detail & Related papers (2024-08-01T17:48:08Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - CLIP Brings Better Features to Visual Aesthetics Learners [12.0962117940694]
Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure.
In this work, an unified and flexible two-phase textbfCLIP-based textbfSemi-supervised textbfKnowledge textbfDistillation paradigm is proposed, namely textbftextitCSKD.
arXiv Detail & Related papers (2023-07-28T16:00:21Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.