Related papers: Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

URL: http://arxiv.org/abs/2411.15770v1
Date: Sun, 24 Nov 2024 09:48:03 GMT
Title: Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering
Authors: Zhicheng Zhao, Changfu Zhou, Yu Zhang, Chenglong Li, Xiaoliang Ma, Jin Tang,
Abstract summary: Current Remote Sensing Visual Question Answering (RSVQA) methods are limited by the imaging mechanisms of optical sensors. We propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet) to improve RSVQA performance. We create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods.
Score: 26.8129265632403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Remote Sensing Visual Question Answering (RSVQA) has gained significant research interest. However, current RSVQA methods are limited by the imaging mechanisms of optical sensors, particularly under challenging conditions such as cloud-covered and low-light scenarios. Given the all-time and all-weather imaging capabilities of Synthetic Aperture Radar (SAR), it is crucial to investigate the integration of optical-SAR images to improve RSVQA performance. In this work, we propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet), which leverages the semantic relationships between question text and multi-source images to guide the network toward complementary fusion at the feature level. Specifically, we develop a Text-guided Coarse-to-Fine Attention Refinement (CFAR) module to focus on key areas related to the question in complex remote sensing images. This module progressively directs attention from broad areas to finer details through key region routing, enhancing the model's ability to focus on relevant regions. Furthermore, we propose an Adaptive Multi-Expert Fusion (AMEF) module that dynamically integrates different experts, enabling the adaptive fusion of optical and SAR features. In addition, we create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods, comprising 6,008 well-aligned optical-SAR image pairs and 1,036,694 well-labeled question-answer pairs across 16 diverse question types, including complex relational reasoning questions. Extensive experiments on the proposed dataset demonstrate that our TGFNet effectively integrates complementary information between optical and SAR images, significantly improving the model's performance in challenging scenarios. The dataset is available at: https://github.com/mmic-lcl/. Index Terms: Remote Sensing Visual Question Answering, Multi-source Data Fusion, Multimodal, Remote Sensing, OPT-SAR.

Related papers

LSFDNet: A Single-Stage Fusion and Detection Network for Ships Using SWIR and LWIR [16.16208006025223]
Short-wave infrared (SWIR) and long-wave infrared (LWIR) are used in ship detection.<n>We propose a novel single-stage image fusion detection algorithm called LSFDNet.<n>This algorithm leverages feature interaction between the image fusion and object detection subtask networks.<n>We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets.
arXiv Detail & Related papers (2025-07-28T07:13:55Z)
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection [28.405249208866067]
Single-source remote sensing object detection using optical or SAR images struggles in complex environments.<n>We propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR)<n>To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods.
arXiv Detail & Related papers (2025-05-16T07:10:07Z)
Cloud Removal With PolSAR-Optical Data Fusion Using A Two-Flow Residual Network [9.529237717137121]
Reconstructing cloud-free optical images has become a major task in recent years. This paper presents a two-flow Polarimetric Synthetic Aperture Radar (PolSAR)-Optical data fusion cloud removal algorithm.
arXiv Detail & Related papers (2025-01-14T07:35:14Z)
DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion [21.64382683858586]
Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. We propose a dual-branch feature decomposition fusion network (DAF-Net) with Maximum domain adaptive. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images.
arXiv Detail & Related papers (2024-09-18T02:14:08Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
An Interactively Reinforced Paradigm for Joint Infrared-Visible Image Fusion and Saliency Object Detection [59.02821429555375]
This research focuses on the discovery and localization of hidden objects in the wild and serves unmanned systems. Through empirical analysis, infrared and visible image fusion (IVIF) enables hard-to-find objects apparent. multimodal salient object detection (SOD) accurately delineates the precise spatial location of objects within the picture.
arXiv Detail & Related papers (2023-05-17T06:48:35Z)
A lightweight multi-scale context network for salient object detection in optical remote sensing images [16.933770557853077]
We propose a multi-scale context network, namely MSCNet, for salient object detection in optical RSIs. Specifically, a multi-scale context extraction module is adopted to address the scale variation of salient objects. In order to accurately detect complete salient objects in complex backgrounds, we design an attention-based pyramid feature aggregation mechanism.
arXiv Detail & Related papers (2022-05-18T14:32:47Z)
Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images [108.79667788962425]
salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. We propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features.
arXiv Detail & Related papers (2021-12-02T04:46:40Z)
RRNet: Relational Reasoning Network with Parallel Multi-scale Attention for Salient Object Detection in Optical Remote Sensing Images [82.1679766706423]
Salient object detection (SOD) for optical remote sensing images (RSIs) aims at locating and extracting visually distinctive objects/regions from the optical RSIs. We propose a relational reasoning network with parallel multi-scale attention for SOD in optical RSIs. Our proposed RRNet outperforms the existing state-of-the-art SOD competitors both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-10-27T07:18:32Z)
The QXS-SAROPT Dataset for Deep Learning in SAR-Optical Data Fusion [14.45289690639374]
We publish the QXS-SAROPT dataset to foster deep learning research in SAR-optical data fusion. We show exemplary results for two representative applications, namely SAR-optical image matching and SAR ship detection boosted by cross-modal information from optical images.
arXiv Detail & Related papers (2021-03-15T10:22:46Z)
Deep Burst Super-Resolution [165.90445859851448]
We propose a novel architecture for the burst super-resolution task. Our network takes multiple noisy RAW images as input, and generates a denoised, super-resolved RGB image as output. In order to enable training and evaluation on real-world data, we additionally introduce the BurstSR dataset.
arXiv Detail & Related papers (2021-01-26T18:57:21Z)
Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images [193.77450545067967]
We propose an end-to-end Dense Attention Fluid Network (DAFNet) for salient object detection in optical remote sensing images (RSIs) A Global Context-aware Attention (GCA) module is proposed to adaptively capture long-range semantic context relationships. We construct a new and challenging optical RSI dataset for SOD that contains 2,000 images with pixel-wise saliency annotations.
arXiv Detail & Related papers (2020-11-26T06:14:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.