RRSIS: Referring Remote Sensing Image Segmentation
- URL: http://arxiv.org/abs/2306.08625v2
- Date: Fri, 1 Mar 2024 21:10:52 GMT
- Title: RRSIS: Referring Remote Sensing Image Segmentation
- Authors: Zhenghang Yuan, Lichao Mou, Yuansheng Hua, Xiao Xiang Zhu
- Abstract summary: Localizing desired objects from remote sensing images is of great use in practical applications.
Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images.
We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
- Score: 25.538406069768662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing desired objects from remote sensing images is of great use in
practical applications. Referring image segmentation, which aims at segmenting
out the objects to which a given expression refers, has been extensively
studied in natural images. However, almost no research attention is given to
this task of remote sensing imagery. Considering its potential for real-world
applications, in this paper, we introduce referring remote sensing image
segmentation (RRSIS) to fill in this gap and make some insightful explorations.
Specifically, we create a new dataset, called RefSegRS, for this task, enabling
us to evaluate different methods. Afterward, we benchmark referring image
segmentation methods of natural images on the RefSegRS dataset and find that
these models show limited efficacy in detecting small and scattered objects. To
alleviate this issue, we propose a language-guided cross-scale enhancement
(LGCE) module that utilizes linguistic features to adaptively enhance
multi-scale visual features by integrating both deep and shallow features. The
proposed dataset, benchmarking results, and the designed LGCE module provide
insights into the design of a better RRSIS model. We will make our dataset and
code publicly available.
Related papers
- Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing [11.626527403157922]
We present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework for remote sensing imagery.
PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines.
It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.
arXiv Detail & Related papers (2024-11-09T07:06:31Z) - Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem.
In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images.
The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z) - RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data [14.742224345061487]
We introduce the task of visual grounding for remote sensing data (RSVG)
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language.
In this work, we construct a large-scale benchmark dataset of RSVG and explore deep learning models for the RSVG task.
arXiv Detail & Related papers (2022-10-23T07:08:22Z) - Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and
Local Information [15.32353270625554]
Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images.
We first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels.
Experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task.
arXiv Detail & Related papers (2022-04-21T03:18:09Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - An End-to-end Framework For Low-Resolution Remote Sensing Semantic
Segmentation [0.5076419064097732]
We propose an end-to-end framework that unites a super-resolution and a semantic segmentation module.
It allows the semantic segmentation network to conduct the reconstruction process, modifying the input image with helpful textures.
The results show that the framework is capable of achieving a semantic segmentation performance close to native high-resolution data.
arXiv Detail & Related papers (2020-03-17T21:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.