Related papers: Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

URL: http://arxiv.org/abs/2503.07911v1
Date: Mon, 10 Mar 2025 23:15:57 GMT
Title: Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing
Authors: Xing Zi, Kairui Jin, Xian Tao, Jun Li, Ali Braytee, Rajiv Ratn Shah, Mukesh Prasad,
Abstract summary: We introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation.<n>Our pipeline is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.
Score: 30.980687857037033
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.

Related papers

Enabling Training-Free Text-Based Remote Sensing Segmentation [21.31811964222322]
Text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models.<n>We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM)<n>Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting.<n>In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a
arXiv Detail & Related papers (2026-02-19T20:05:56Z)
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images [51.42466259821335]
We present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training.<n>First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder.<n>Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene.
arXiv Detail & Related papers (2025-12-09T15:42:28Z)
Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images [21.294581646546124]
AerOSeg is a novel Open-Vocabulary (OVS) approach for remote sensing data. We compute robust image-text correlation features using rotated versions of the input image and domain-specific prompts. Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features. We enhance the refined correlation features using a multi-scale attention-aware composition to produce the final segmentation map.
arXiv Detail & Related papers (2025-04-12T13:06:46Z)
ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models [10.858627659431928]
Service robots must effectively recognize and segment unknown objects to enhance their functionality.<n>Traditional supervised learningbased segmentation techniques require extensive annotated datasets.<n>This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT)
arXiv Detail & Related papers (2025-02-05T15:22:20Z)
Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing [11.626527403157922]
We present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework for remote sensing imagery. PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.
arXiv Detail & Related papers (2024-11-09T07:06:31Z)
Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework [44.44329455757931]
In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information.<n>Traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference.<n>We propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-branch Sampling (SMS) module and multi-view constraints.
arXiv Detail & Related papers (2024-07-08T09:25:45Z)
FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models. We propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z)
RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation [10.37240769959699]
Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. We propose RSAM-Seg, which stands for Remote Sensing SAM with Semantic, as a tailored modification of SAM for the remote sensing field. Adapter-Scale, a set of supplementary scaling modules, are proposed in the multi-head attention blocks of the encoder part of SAM. Experiments are conducted on four distinct remote sensing scenarios, encompassing cloud detection, field monitoring, building detection and road mapping tasks.
arXiv Detail & Related papers (2024-02-29T09:55:46Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z)
Discovery-and-Selection: Towards Optimal Multiple Instance Learning for Weakly Supervised Object Detection [86.86602297364826]
We propose a discoveryand-selection approach fused with multiple instance learning (DS-MIL) Our proposed DS-MIL approach can consistently improve the baselines, reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-10-18T07:06:57Z)
Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online. PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.