CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
Zero-shot Semantic Segmentation
- URL: http://arxiv.org/abs/2310.02296v2
- Date: Wed, 21 Feb 2024 12:31:03 GMT
- Title: CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
Zero-shot Semantic Segmentation
- Authors: Jialei Chen, Daisuke Deguchi, Chenkai Zhang, Xu Zheng, Hiroshi Murase
- Abstract summary: Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones.
Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance.
We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
- Score: 6.181169909576527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalized Zero-shot Semantic Segmentation aims to segment both seen and
unseen categories only under the supervision of the seen ones. To tackle this,
existing methods adopt the large-scale Vision Language Models (VLMs) which
obtain outstanding zero-shot performance. However, as the VLMs are designed for
classification tasks, directly adapting the VLMs may lead to sub-optimal
performance. Consequently, we propose CLIP-ZSS (Zero-shot Semantic
Segmentation), a simple but effective training framework that enables any image
encoder designed for closed-set segmentation applied in zero-shot and
open-vocabulary tasks in testing without combining with VLMs or inserting new
modules. CLIP-ZSS consists of two key modules: Global Learning Module (GLM) and
Pixel Learning Module (PLM). GLM is proposed to probe the knowledge from the
CLIP visual encoder by pulling the CLS token and the dense features from the
image encoder of the same image and pushing others apart. Moreover, to enhance
the ability to discriminate unseen categories, PLM consisting of pseudo labels
and weight generation is designed. To generate semantically discriminated
pseudo labels, a multi-scale K-Means with mask fusion working on the dense
tokens is proposed. In pseudo weight generation, a synthesizer generating
pseudo semantic features for the unannotated area is introduced. Experiments on
three benchmarks show large performance gains compared with SOTA methods.
Related papers
- PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks.
The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic
Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision.
We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space.
SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z) - TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary
Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification.
CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class.
We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z) - Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic
Segmentation [17.914290294935427]
Traditional 3D segmentation methods can only recognize a fixed range of classes that appear in the training set.
Large-scale visual-language pre-trained models, such as CLIP, have shown their generalization ability in the zero-shot 2D vision tasks.
We propose a simple yet effective baseline to transfer the visual-linguistic knowledge implied in CLIP to point cloud encoder.
arXiv Detail & Related papers (2023-12-12T12:35:59Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.