Related papers: RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models

RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models

URL: http://arxiv.org/abs/2501.06809v1
Date: Sun, 12 Jan 2025 13:22:35 GMT
Title: RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models
Authors: Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, Zhenwei Shi,
Abstract summary: Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.<n>We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.<n> Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
Score: 24.67117013862316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \url{https://github.com/KyanChen/RSRefSeg}.

Related papers

Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning [11.015244501780078]
This paper presents a semantic-spatial feature fusion with dynamic graph refinement (SFDR) method. The proposed SFDR method significantly enhances the quality of the generated descriptions. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-03-30T14:14:41Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges. Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z)
Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.13782704236074]
We propose a new referring remote sensing image segmentation method to fully exploit the visual and linguistic representations.<n>The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.<n>We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning [21.907749083387042]
This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval.
arXiv Detail & Related papers (2024-05-16T14:53:45Z)
Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes. Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z)
Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks. $pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency. As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
RRSIS: Referring Remote Sensing Image Segmentation [25.538406069768662]
Localizing desired objects from remote sensing images is of great use in practical applications. Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images. We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
arXiv Detail & Related papers (2023-06-14T16:40:19Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.