Side Adapter Network for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2302.12242v2
- Date: Wed, 22 Mar 2023 09:45:35 GMT
- Title: Side Adapter Network for Open-Vocabulary Semantic Segmentation
- Authors: Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu and Xiang Bai
- Abstract summary: This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
- Score: 69.18441687386733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new framework for open-vocabulary semantic segmentation
with the pre-trained vision-language model, named Side Adapter Network (SAN).
Our approach models the semantic segmentation task as a region recognition
problem. A side network is attached to a frozen CLIP model with two branches:
one for predicting mask proposals, and the other for predicting attention bias
which is applied in the CLIP model to recognize the class of masks. This
decoupled design has the benefit CLIP in recognizing the class of mask
proposals. Since the attached side network can reuse CLIP features, it can be
very light. In addition, the entire network can be trained end-to-end, allowing
the side network to be adapted to the frozen CLIP model, which makes the
predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only
adds a few additional trainable parameters. We evaluate our approach on
multiple semantic segmentation benchmarks. Our method significantly outperforms
other counterparts, with up to 18 times fewer trainable parameters and 19 times
faster inference speed. We hope our approach will serve as a solid baseline and
help ease future research in open-vocabulary semantic segmentation. The code
will be available at https://github.com/MendelXu/SAN.
Related papers
- Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z) - Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network [26.97153244517095]
We propose a network that only needs a single pass through the visual-language model for each input image.
We first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder.
We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification.
arXiv Detail & Related papers (2023-04-03T17:59:21Z) - Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - Per-Pixel Classification is Not All You Need for Semantic Segmentation [184.2905747595058]
Mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks.
We propose MaskFormer, a simple mask classification model which predicts a set of binary masks.
Our method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
arXiv Detail & Related papers (2021-07-13T17:59:50Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.