SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
- URL: http://arxiv.org/abs/2504.03254v1
- Date: Fri, 04 Apr 2025 08:09:53 GMT
- Title: SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
- Authors: Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruixuan Chen, Junshi Xia, Naoto Yokoya,
- Abstract summary: Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction.<n>We introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality.<n>It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories, and multi-task question-answering pairs spanning seven applications and 1,012 question types.
- Score: 20.314150537672198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.
Related papers
- SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning [15.611051083630862]
We construct a large-scale and high-quality SAR image-text dataset consisting of over 130,000 SAR image-text pairs.<n>To verify the effectiveness of the SAR-Text dataset, we conduct experiments on three typical vision-language tasks.<n> SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43%.<n>In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively.
arXiv Detail & Related papers (2025-07-24T18:45:30Z) - MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution [76.30559905769859]
MegaSR mines customized block-wise semantics and expressive guidance for diffusion-based ISR.
We experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance.
Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
arXiv Detail & Related papers (2025-03-11T07:00:20Z) - GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis [17.83602731408318]
We introduce GAIA, a novel dataset for multi-scale, multi-sensor, and multi-modal Remote Sensing (RS) image analysis.
GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions.
GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.
arXiv Detail & Related papers (2025-02-13T18:52:14Z) - SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation [12.32553804641971]
Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding.<n>This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M.
arXiv Detail & Related papers (2025-02-12T07:19:36Z) - Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders [5.234109158596138]
Self-supervised learning (SSL) is proposed to learn feature representations of SAR images during the pre-training process.
The proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks.
arXiv Detail & Related papers (2025-01-20T03:28:34Z) - Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers [58.14748181398049]
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy.<n>With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention.<n>We propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs.
arXiv Detail & Related papers (2024-12-21T09:30:45Z) - PBVS 2024 Solution: Self-Supervised Learning and Sampling Strategies for SAR Classification in Extreme Long-Tail Distribution [5.965417506363093]
We propose a two-stage learning approach that utilizes self-supervised techniques, combined with multimodal learning and inference.
Our model achieved an accuracy of 21.45%, an AUC of 0.56, and a total score of 0.30, placing us 9th in the competition.
arXiv Detail & Related papers (2024-12-17T05:49:16Z) - Electrooptical Image Synthesis from SAR Imagery Using Generative Adversarial Networks [0.0]
This research contributes to the field of remote sensing by bridging the gap between SAR and EO imagery.
The results show significant improvements in interpretability, making SAR data more accessible for analysts familiar with EO imagery.
Our research contributes to the field of remote sensing by bridging the gap between SAR and EO imagery, offering a novel tool for enhanced data interpretation.
arXiv Detail & Related papers (2024-09-07T14:31:46Z) - HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model [88.13261547704444]
Hyper SIGMA is a vision transformer-based foundation model that unifies HSI interpretation across tasks and scenes.<n>In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images.
arXiv Detail & Related papers (2024-06-17T13:22:58Z) - RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection [79.23689506129733]
We establish a new benchmark dataset and an open-source method for large-scale SAR object detection.
Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets.
To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created.
arXiv Detail & Related papers (2024-03-11T09:20:40Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - DDet: Dual-path Dynamic Enhancement Network for Real-World Image
Super-Resolution [69.2432352477966]
Real image super-resolution(Real-SR) focus on the relationship between real-world high-resolution(HR) and low-resolution(LR) image.
In this article, we propose a Dual-path Dynamic Enhancement Network(DDet) for Real-SR.
Unlike conventional methods which stack up massive convolutional blocks for feature representation, we introduce a content-aware framework to study non-inherently aligned image pair.
arXiv Detail & Related papers (2020-02-25T18:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.