Side Adapter Network for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2302.12242v2
- Date: Wed, 22 Mar 2023 09:45:35 GMT
- Title: Side Adapter Network for Open-Vocabulary Semantic Segmentation
- Authors: Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu and Xiang Bai
- Abstract summary: This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
- Score: 69.18441687386733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new framework for open-vocabulary semantic segmentation
with the pre-trained vision-language model, named Side Adapter Network (SAN).
Our approach models the semantic segmentation task as a region recognition
problem. A side network is attached to a frozen CLIP model with two branches:
one for predicting mask proposals, and the other for predicting attention bias
which is applied in the CLIP model to recognize the class of masks. This
decoupled design has the benefit CLIP in recognizing the class of mask
proposals. Since the attached side network can reuse CLIP features, it can be
very light. In addition, the entire network can be trained end-to-end, allowing
the side network to be adapted to the frozen CLIP model, which makes the
predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only
adds a few additional trainable parameters. We evaluate our approach on
multiple semantic segmentation benchmarks. Our method significantly outperforms
other counterparts, with up to 18 times fewer trainable parameters and 19 times
faster inference speed. We hope our approach will serve as a solid baseline and
help ease future research in open-vocabulary semantic segmentation. The code
will be available at https://github.com/MendelXu/SAN.
Related papers
- Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding.
To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm.
PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network [26.97153244517095]
We propose a network that only needs a single pass through the visual-language model for each input image.
We first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder.
We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification.
arXiv Detail & Related papers (2023-04-03T17:59:21Z) - Open-Vocabulary Universal Image Segmentation with MaskCLIP [24.74805434602145]
We tackle an emerging computer vision task, open-vocabulary universal image segmentation.
We first build a baseline method by directly adopting pre-trained CLIP models.
We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual.
arXiv Detail & Related papers (2022-08-18T17:55:37Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.