Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation
- URL: http://arxiv.org/abs/2511.15118v1
- Date: Wed, 19 Nov 2025 04:41:43 GMT
- Title: Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation
- Authors: Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu, Honglong Chen,
- Abstract summary: We propose an Unbiased Semantic Decoding (USD) strategy integrated with Segment Anything Model (SAM)<n>USD strategy extracts target information from both the support and query set simultaneously to perform consistent predictions.<n>To generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed.
- Score: 36.731980769369834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.
Related papers
- Target-Oriented Single Domain Generalization [27.182037614828968]
Deep models trained on a single source domain often fail catastrophically under distribution shifts.<n>We propose Target-Oriented Single Domain Generalization, a novel problem setup that leverages the textual description of the target domain.<n>We introduce Spectral TARget Alignment (STAR), a module that injects target semantics into source features.
arXiv Detail & Related papers (2025-08-30T04:21:48Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning [61.666973416903005]
Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts.
We propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context.
arXiv Detail & Related papers (2024-06-01T16:21:39Z) - VRP-SAM: SAM with Visual Reference Prompt [76.71829864364283]
We propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM)<n>VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image.<n>To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy.
arXiv Detail & Related papers (2024-02-27T17:58:09Z) - Boosting Segment Anything Model Towards Open-Vocabulary Learning [69.24734826209367]
Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model.<n>Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics.<n>We present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework.
arXiv Detail & Related papers (2023-12-06T17:19:00Z) - Self-guided Few-shot Semantic Segmentation for Remote Sensing Imagery
Based on Large Vision Models [14.292149307183967]
This research introduces a structured framework designed for the automation of few-shot semantic segmentation.
It utilizes the SAM model and facilitates a more efficient generation of semantically discernible segmentation outcomes.
Central to our methodology is a novel automatic prompt learning approach, leveraging prior guided masks to produce coarse pixel-wise prompts for SAM.
arXiv Detail & Related papers (2023-11-22T07:07:55Z) - Few-Shot Classification & Segmentation Using Large Language Models Agent [0.7550566004119158]
We introduce a method that utilises large language models (LLM) as an agent to address the FS-CS problem in a training-free manner.
Our approach achieves state-of-the-art performance on the Pascal-5i dataset.
arXiv Detail & Related papers (2023-11-19T00:33:41Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Weakly-Supervised Semantic Segmentation via Sub-category Exploration [73.03956876752868]
We propose a simple yet effective approach to enforce the network to pay attention to other parts of an object.
Specifically, we perform clustering on image features to generate pseudo sub-categories labels within each annotated parent class.
We conduct extensive analysis to validate the proposed method and show that our approach performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2020-08-03T20:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.