Related papers: A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning

A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning

URL: http://arxiv.org/abs/2411.01445v1
Date: Sun, 03 Nov 2024 06:03:39 GMT
Title: A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
Authors: Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng,
Abstract summary: Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models. This letter proposes a novel VQA approach that integrates object detection networks with visual language models. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis.
Score: 10.748210940033484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models, which demands significant time and resources. This has greatly hindered the application of VQA to downstream tasks, such as ship information analysis based on Synthetic Aperture Radar (SAR) imagery. To address this challenge, this letter proposes a novel VQA approach that integrates object detection networks with visual language models, specifically designed for analyzing ships in SAR images. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis, as well as risk behavior detection. Initially, we conducted baseline experiments using YOLO networks on two representative SAR ship detection datasets, SSDD and HRSID, to assess each model's performance in terms of detection accuracy. Based on these results, we selected the optimal model, YOLOv8n, as the most suitable detection network for this task. Subsequently, leveraging the vision-language model Qwen2-VL, we designed and implemented a VQA task specifically for SAR scenes. This task employs the ship location and size information output by the detection network to generate multi-turn dialogues and scene descriptions for SAR imagery. Experimental results indicate that this method not only enables fundamental SAR scene question-answering without the need for additional datasets or fine-tuning but also dynamically adapts to complex, multi-turn dialogue requirements, demonstrating robust semantic understanding and adaptability.

Related papers

Visual Question Answering on Multiple Remote Sensing Image Modalities [1.6932802756478726]
In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities.<n>We introduce a new VQA dataset, named TAMMI, with diverse questions on scenes described by three different modalities.<n>We also propose the MM-RSVQA model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text.
arXiv Detail & Related papers (2025-05-21T11:42:47Z)
SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling [41.24071764578782]
Object detection in satellite-borne Synthetic Aperture Radar imagery holds immense potential in tasks such as urban monitoring and disaster response. The detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model.
arXiv Detail & Related papers (2025-04-17T19:44:05Z)
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks. Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z)
SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations [27.391859339238906]
In this paper, we explore the potential of generative diffusion model in synthesizing hyperspectral images with pixel-level annotations. To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations. We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks.
arXiv Detail & Related papers (2025-02-24T11:13:37Z)
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation [12.32553804641971]
Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M.
arXiv Detail & Related papers (2025-02-12T07:19:36Z)
SAR Strikes Back: A New Hope for RSVQA [1.6249398255272318]
We present a dataset that allows for the introduction of SAR images in the RSVQA framework. SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds. We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.
arXiv Detail & Related papers (2025-01-14T14:07:48Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network [24.54269823691119]
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets.
arXiv Detail & Related papers (2024-08-02T09:31:21Z)
RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks [11.681342476516267]
We propose a Remote Distributed Sensing Foundation Model (RS-DFM) based on generalized information mapping and interaction. This model can realize online collaborative perception across multiple platforms and various downstream tasks. We present a dual-branch information compression module to decouple high-frequency and low-frequency feature information.
arXiv Detail & Related papers (2024-06-11T07:46:47Z)
DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem. To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects. In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z)
SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection [79.23689506129733]
We establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created.
arXiv Detail & Related papers (2024-03-11T09:20:40Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation [6.032808648673282]
The learn-from-observation (LfO) paradigm is a human-inspired mode for a robot to learn to perform a task simply by watching it being performed. We present multi-view SA-Net, which generalizes the SA-Net model to allow the perception of multiple viewpoints of the task activity.
arXiv Detail & Related papers (2023-11-14T18:53:28Z)
ADASR: An Adversarial Auto-Augmentation Framework for Hyperspectral and Multispectral Data Fusion [54.668445421149364]
Deep learning-based hyperspectral image (HSI) super-resolution aims to generate high spatial resolution HSI (HR-HSI) by fusing hyperspectral image (HSI) and multispectral image (MSI) with deep neural networks (DNNs) In this letter, we propose a novel adversarial automatic data augmentation framework ADASR that automatically optimize and augments HSI-MSI sample pairs to enrich data diversity for HSI-MSI fusion.
arXiv Detail & Related papers (2023-10-11T07:30:37Z)
From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation. There are questions with clearly different difficulty levels for each image in the RSVQA task. A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z)
SAR-ShipNet: SAR-Ship Detection Neural Network via Bidirectional Coordinate Attention and Multi-resolution Feature Fusion [7.323279438948967]
This paper studies a practically meaningful ship detection problem from synthetic aperture radar (SAR) images by the neural network. We propose a SAR-ship detection neural network (call SAR-ShipNet for short), by newly developing Bidirectional Coordinate Attention (BCA) and Multi-resolution Feature Fusion (MRF) based on CenterNet. Experimental results on the public SAR-Ship dataset show that our SAR-ShipNet achieves competitive advantages in both speed and accuracy.
arXiv Detail & Related papers (2022-03-29T12:27:04Z)
Context-Preserving Instance-Level Augmentation and Deformable Convolution Networks for SAR Ship Detection [50.53262868498824]
Shape deformation of targets in SAR image due to random orientation and partial information loss is an essential challenge in SAR ship detection. We propose a data augmentation method to train a deep network that is robust to partial information loss within the targets.
arXiv Detail & Related papers (2022-02-14T07:01:01Z)
Cross-Attention in Coupled Unmixing Nets for Unsupervised Hyperspectral Super-Resolution [79.97180849505294]
We propose a novel coupled unmixing network with a cross-attention mechanism, CUCaNet, to enhance the spatial resolution of HSI. Experiments are conducted on three widely-used HS-MS datasets in comparison with state-of-the-art HSI-SR models.
arXiv Detail & Related papers (2020-07-10T08:08:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.