TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic
Segmentation
- URL: http://arxiv.org/abs/2304.07547v1
- Date: Sat, 15 Apr 2023 12:52:23 GMT
- Title: TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic
Segmentation
- Authors: Jingyao Li, Pengguang Chen, Shengju Qian, Jiaya Jia
- Abstract summary: Contrastive Language-Image Pre-training(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks.
Existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones.
We disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability.
- Score: 55.575224613422726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent success of Contrastive Language-Image Pre-training~(CLIP) has shown
great promise in pixel-level open-vocabulary learning tasks. A general paradigm
utilizes CLIP's text and patch embeddings to generate semantic masks. However,
existing models easily misidentify input pixels from unseen classes, thus
confusing novel classes with semantically-similar ones. In our work, we
disentangle the ill-posed optimization problem into two parallel processes: one
performs semantic matching individually, and the other judges reliability for
improving discrimination ability. Motivated by special tokens in language
modeling that represents sentence-level embeddings, we design a trusty token
that decouples the known and novel category prediction tendency. With almost no
extra overhead, we upgrade the pixel-level generalization capacity of existing
models effectively. Our TagCLIP (CLIP adapting with Trusty-guidance) boosts the
IoU of unseen classes by 7.4% and 1.7% on PASCAL VOC 2012 and COCO-Stuff 164K.
Related papers
- Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg.
AdaB Decoder is designed to generate different image embeddings for both training and new classes.
SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - CLIP-S$^4$: Language-Guided Self-Supervised Semantic Segmentation [15.29479338808226]
We present CLIP-S$4$ that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks.
Our approach shows consistent and substantial performance improvement over four popular benchmarks.
arXiv Detail & Related papers (2023-05-01T19:01:01Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.