Related papers: TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2304.07547v1
Date: Sat, 15 Apr 2023 12:52:23 GMT
Title: TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation
Authors: Jingyao Li, Pengguang Chen, Shengju Qian, Jiaya Jia
Abstract summary: Contrastive Language-Image Pre-training(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks. Existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones. We disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability.
Score: 55.575224613422726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent success of Contrastive Language-Image Pre-training~(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks. A general paradigm utilizes CLIP's text and patch embeddings to generate semantic masks. However, existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones. In our work, we disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability. Motivated by special tokens in language modeling that represents sentence-level embeddings, we design a trusty token that decouples the known and novel category prediction tendency. With almost no extra overhead, we upgrade the pixel-level generalization capacity of existing models effectively. Our TagCLIP (CLIP adapting with Trusty-guidance) boosts the IoU of unseen classes by 7.4% and 1.7% on PASCAL VOC 2012 and COCO-Stuff 164K.

Related papers

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z)
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class. We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z)
CLIP-S$^4$: Language-Guided Self-Supervised Semantic Segmentation [15.29479338808226]
We present CLIP-S$4$ that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks. Our approach shows consistent and substantial performance improvement over four popular benchmarks.
arXiv Detail & Related papers (2023-05-01T19:01:01Z)
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z)
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation. The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space. OrdinalCLIP consists of learnable context tokens and learnable rank embeddings. Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z)
A Multi-level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding. Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.