RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data
- URL: http://arxiv.org/abs/2210.12634v1
- Date: Sun, 23 Oct 2022 07:08:22 GMT
- Title: RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data
- Authors: Yang Zhan, Zhitong Xiong and Yuan Yuan
- Abstract summary: We introduce the task of visual grounding for remote sensing data (RSVG)
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language.
In this work, we construct a large-scale benchmark dataset of RSVG and explore deep learning models for the RSVG task.
- Score: 14.742224345061487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce the task of visual grounding for remote sensing
data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS)
images with the guidance of natural language. To retrieve rich information from
RS imagery using natural language, many research tasks, like RS image visual
question answering, RS image captioning, and RS image-text retrieval have been
investigated a lot. However, the object-level visual grounding on RS images is
still under-explored. Thus, in this work, we propose to construct the dataset
and explore deep learning models for the RSVG task. Specifically, our
contributions can be summarized as follows. 1) We build the new large-scale
benchmark dataset of RSVG, termed RSVGD, to fully advance the research of RSVG.
This new dataset includes image/expression/box triplets for training and
evaluating visual grounding models. 2) We benchmark extensive state-of-the-art
(SOTA) natural image visual grounding methods on the constructed RSVGD dataset,
and some insightful analyses are provided based on the results. 3) A novel
transformer-based Multi-Level Cross-Modal feature learning (MLCM) module is
proposed. Remotely-sensed images are usually with large scale variations and
cluttered backgrounds. To deal with the scale-variation problem, the MLCM
module takes advantage of multi-scale visual features and multi-granularity
textual embeddings to learn more discriminative representations. To cope with
the cluttered background problem, MLCM adaptively filters irrelevant noise and
enhances salient features. In this way, our proposed model can incorporate more
effective multi-level and multi-modal features to boost performance.
Furthermore, this work also provides useful insights for developing better RSVG
models. The dataset and code will be publicly available at
https://github.com/ZhanYang-nwpu/RSVG-pytorch.
Related papers
- Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG [24.342190878813234]
ImageRAG for RS is a training-free framework to address the complexities of analyzing UHR remote sensing imagery.
ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts.
arXiv Detail & Related papers (2024-11-12T10:12:12Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [3.178739428363249]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - RRSIS: Referring Remote Sensing Image Segmentation [25.538406069768662]
Localizing desired objects from remote sensing images is of great use in practical applications.
Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images.
We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
arXiv Detail & Related papers (2023-06-14T16:40:19Z) - Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem.
In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images.
The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z) - Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and
Local Information [15.32353270625554]
Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images.
We first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels.
Experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task.
arXiv Detail & Related papers (2022-04-21T03:18:09Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.