Open-Vocabulary Universal Image Segmentation with MaskCLIP
- URL: http://arxiv.org/abs/2208.08984v2
- Date: Thu, 8 Jun 2023 06:35:33 GMT
- Title: Open-Vocabulary Universal Image Segmentation with MaskCLIP
- Authors: Zheng Ding, Jieke Wang, Zhuowen Tu
- Abstract summary: We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
- Score: 24.74805434602145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle an emerging computer vision task, open-vocabulary
universal image segmentation, that aims to perform semantic/instance/panoptic
segmentation (background semantic labeling + foreground instance segmentation)
for arbitrary categories of text-based descriptions in inference time. We first
build a baseline method by directly adopting pre-trained CLIP models without
finetuning or distillation. We then develop MaskCLIP, a Transformer-based
approach with a MaskCLIP Visual Encoder, which is an encoder-only module that
seamlessly integrates mask tokens with a pre-trained ViT CLIP model for
semantic/instance segmentation and class prediction. MaskCLIP learns to
efficiently and effectively utilize pre-trained partial/dense CLIP features
within the MaskCLIP Visual Encoder that avoids the time-consuming
student-teacher training process. MaskCLIP outperforms previous methods for
semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show
qualitative illustrations for MaskCLIP with online custom categories. Project
website: https://maskclip.github.io.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance.
We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment.
MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation [44.450243388665776]
We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation.
Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
arXiv Detail & Related papers (2024-03-19T05:27:04Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.