ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation
- URL: http://arxiv.org/abs/2507.02294v1
- Date: Thu, 03 Jul 2025 04:06:04 GMT
- Title: ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation
- Authors: Hanbo Bi, Yulong Xu, Ya Li, Yongqiang Mao, Boyuan Tong, Chongyang Li, Chunbo Lang, Wenhui Diao, Hongqi Wang, Yingchao Feng, Xian Sun,
- Abstract summary: ViRefSAM is a novel framework that guides SAM utilizing only a few annotated reference images.<n>It enables automatic segmentation of class-consistent objects across RS images.<n>It consistently outperforms existing few-shot segmentation methods across diverse datasets.
- Score: 21.953205396218767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.
Related papers
- SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation [4.4700130387278225]
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples.<n>We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit.
arXiv Detail & Related papers (2025-05-27T21:51:28Z) - DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency [91.30252180093333]
We propose the Dual Consistency SAM (DCSAM) method based on prompttuning to adapt SAM and SAM2 for in-context segmentation.<n>Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts.<n>Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support SAM2.
arXiv Detail & Related papers (2025-04-16T13:41:59Z) - Vision and Language Reference Prompt into SAM for Few-shot Segmentation [1.9458156037869137]
Segment Anything Model (SAM) is a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts.<n>Few-shot segmentation models address these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts.<n>We propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM, that utilizes the visual information of the reference images and the semantic information of the text labels.
arXiv Detail & Related papers (2025-02-02T08:40:14Z) - SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation [25.00605325290872]
We propose a SAM-aware graph prompt reasoning network (GPRN) to guide CD-FSS feature representation learning.<n>GPRN transforms masks generated by SAM into visual prompts enriched with high-level semantic information.<n>We show that our method establishes new state-of-the-art results.
arXiv Detail & Related papers (2024-12-31T06:38:49Z) - SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation [51.90445260276897]
We prove that the Segment Anything Model 2 (SAM2) can be a strong encoder for U-shaped segmentation models.
We propose a simple but effective framework, termed SAM2-UNet, for versatile image segmentation.
arXiv Detail & Related papers (2024-08-16T17:55:38Z) - SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges.<n>This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation.<n> Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.
In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt.
These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z) - Learning to Prompt Segment Anything Models [55.805816693815835]
Segment Anything Models (SAMs) have demonstrated great potential in learning to segment anything.
SAMs work with two types of prompts including spatial prompts (e.g., points) and semantic prompts (e.g., texts)
We propose spatial-semantic prompt learning (SSPrompt) that learns effective semantic and spatial prompts for better SAMs.
arXiv Detail & Related papers (2024-01-09T16:24:25Z) - Segment anything, from space? [8.126645790463266]
"Segment Anything Model" (SAM) can segment objects in input imagery based on cheap input prompts.
SAM usually achieved recognition accuracy similar to, or sometimes exceeding, vision models that had been trained on the target tasks.
We examine whether SAM's performance extends to overhead imagery problems and help guide the community's response to its development.
arXiv Detail & Related papers (2023-04-25T17:14:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.