Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation
and Beyond
- URL: http://arxiv.org/abs/2308.07539v1
- Date: Tue, 15 Aug 2023 02:46:49 GMT
- Title: Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation
and Beyond
- Authors: Chen Shuai, Meng Fanman, Zhang Runtong, Qiu Heqian, Li Hongliang, Wu
Qingbo, Xu Linfeng
- Abstract summary: We propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net)
It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity.
It achieves new state-of-the-art results in the FSS task, with mIoU of $77.6$ on $textPASCAL-5i$ and $59.4$ on $textCOCO-20i$ in 1-shot scenario.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot segmentation (FSS) aims to segment the novel classes with a few
annotated images. Due to CLIP's advantages of aligning visual and textual
information, the integration of CLIP can enhance the generalization ability of
FSS model. However, even with the CLIP model, the existing CLIP-based FSS
methods are still subject to the biased prediction towards base classes, which
is caused by the class-specific feature level interactions. To solve this
issue, we propose a visual and textual Prior Guided Mask Assemble Network
(PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the
bias, and formulates diverse tasks into a unified manner by assembling the
prior through affinity. Specifically, the class-relevant textual and visual
features are first transformed to class-agnostic prior in the form of
probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including
multiple General Assemble Units (GAUs) is introduced. It considers diverse and
plug-and-play interactions, such as visual-textual, inter- and intra-image,
training-free, and high-order ones. Lastly, to ensure the class-agnostic
ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed
to flexibly exploit the assembled masks and low-level features, without relying
on any class-specific information. It achieves new state-of-the-art results in
the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on
$\text{COCO-}20^i$ in 1-shot scenario. Beyond this, we show that without extra
re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS,
co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot
segmentation framework.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation [15.159690685421586]
This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called Prompt and Transfer" (PAT)
PAT constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task.
Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS, Weak-label, and Zero-shot-label.
arXiv Detail & Related papers (2024-09-16T15:24:26Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts [10.262029691744921]
We present Label Anything, an innovative neural network architecture designed for few-shot semantic segmentation (FSS)
Label Anything demonstrates remarkable generalizability across multiple classes with minimal examples required per class.
Our comprehensive experimental validation, particularly achieving state-of-the-art results on the COCO-$20i$ benchmark, underscores Label Anything's robust generalization and flexibility.
arXiv Detail & Related papers (2024-07-02T09:08:06Z) - Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains.
We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Boosting Semantic Segmentation from the Perspective of Explicit Class
Embeddings [19.997929884477628]
We explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely.
We propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features.
Our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset.
arXiv Detail & Related papers (2023-08-24T16:16:10Z) - Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - A Joint Framework Towards Class-aware and Class-agnostic Alignment for
Few-shot Segmentation [11.47479526463185]
Few-shot segmentation aims to segment objects of unseen classes given only a few annotated support images.
Most existing methods simply stitch query features with independent support prototypes and segment the query image by feeding the mixed features to a decoder.
We propose a joint framework that combines more valuable class-aware and class-agnostic alignment guidance to facilitate the segmentation.
arXiv Detail & Related papers (2022-11-02T17:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.