Related papers: Open-Vocabulary Universal Image Segmentation with MaskCLIP

Open-Vocabulary Universal Image Segmentation with MaskCLIP

URL: http://arxiv.org/abs/2208.08984v2
Date: Thu, 8 Jun 2023 06:35:33 GMT
Title: Open-Vocabulary Universal Image Segmentation with MaskCLIP
Authors: Zheng Ding, Jieke Wang, Zhuowen Tu
Abstract summary: We tackle an emerging computer vision task, open-vocabulary universal image segmentation. We first build a baseline method by directly adopting pre-trained CLIP models. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
Score: 24.74805434602145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning or distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained partial/dense CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. MaskCLIP outperforms previous methods for semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show qualitative illustrations for MaskCLIP with online custom categories. Project website: https://maskclip.github.io.

Related papers

High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation [109.19165503929992]
We present MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z)
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z)
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance. We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z)
Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction. A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z)
CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation [44.450243388665776]
We propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation. Our CLIP-VIS adopts frozen CLIP and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification.
arXiv Detail & Related papers (2024-03-19T05:27:04Z)
PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework. We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z)
Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals. We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z)
Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN) A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias. Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z)
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.