Learning Mask-aware CLIP Representations for Zero-Shot Segmentation
- URL: http://arxiv.org/abs/2310.00240v1
- Date: Sat, 30 Sep 2023 03:27:31 GMT
- Title: Learning Mask-aware CLIP Representations for Zero-Shot Segmentation
- Authors: Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, Humphrey Shi
- Abstract summary: Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
- Score: 120.97144647340588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, pre-trained vision-language models have been increasingly used to
tackle the challenging zero-shot segmentation task. Typical solutions follow
the paradigm of first generating mask proposals and then adopting CLIP to
classify them. To maintain the CLIP's zero-shot transferability, previous
practices favour to freeze CLIP during training. However, in the paper, we
reveal that CLIP is insensitive to different mask proposals and tends to
produce similar predictions for various mask proposals of the same image. This
insensitivity results in numerous false positives when classifying mask
proposals. This issue mainly relates to the fact that CLIP is trained with
image-level supervision. To alleviate this issue, we propose a simple yet
effective method, named Mask-aware Fine-tuning (MAFT). Specifically,
Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary
numbers of image and mask proposals simultaneously. Then, mask-aware loss and
self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP
is responsive to different mask proposals while not sacrificing
transferability. In this way, mask-aware representations can be easily learned
to make the true positives stand out. Notably, our solution can seamlessly plug
into most existing methods without introducing any new parameters during the
fine-tuning process. We conduct extensive experiments on the popular zero-shot
benchmarks. With MAFT, the performance of the state-of-the-art methods is
promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on
Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The
code is available at https://github.com/jiaosiyu1999/MAFT.git.
Related papers
- Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation [63.13635858586001]
Referring Image (RIS) is the problem of identifying objects in images through natural language sentences.
We propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps.
Using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 16.5%.
arXiv Detail & Related papers (2023-10-20T13:20:17Z) - Class-Incremental Exemplar Compression for Class-Incremental Learning [90.93462714376078]
We propose an adaptive mask generation model called class-incremental masking (CIM)
We conduct experiments on high-resolution CIL benchmarks including Food-101, ImageNet-100, and ImageNet-1000.
We show that using the compressed exemplars by CIM can achieve a new state-of-the-art CIL accuracy, e.g., 4.8 percentage points higher than FOSTER on 10-Phase ImageNet-1000.
arXiv Detail & Related papers (2023-03-24T14:51:20Z) - MP-Former: Mask-Piloted Transformer for Image Segmentation [16.620469868310288]
Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - Attentive Mask CLIP [48.206857783966996]
We propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description.
Our approach achieves $43.9%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy.
arXiv Detail & Related papers (2022-12-16T18:59:12Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z) - BoxInst: High-Performance Instance Segmentation with Box Annotations [102.10713189544947]
We present a high-performance method that can achieve mask-level instance segmentation with only bounding-box annotations for training.
Our core idea is to exploit the loss of learning masks in instance segmentation, with no modification to the segmentation network itself.
arXiv Detail & Related papers (2020-12-03T22:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.