SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery
- URL: http://arxiv.org/abs/2510.22665v1
- Date: Sun, 26 Oct 2025 13:04:50 GMT
- Title: SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery
- Authors: Qiwei Ma, Zhiyu Wang, Wang Liu, Xukun Lu, Bin Deng, Puhong Duan, Xudong Kang, Shutao Li,
- Abstract summary: We introduce SARCLIP, the first vision language foundation model tailored for the SAR domain.<n>SARCLIP is trained using a contrastive vision language learning approach by domain transferring strategy.<n>Experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP.
- Score: 46.87845911116779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling (MIM) have paved the way for SAR foundation models, these approaches primarily focus on low-level visual features, often overlooking multimodal alignment and zero-shot target recognition within SAR imagery. To address this limitation, we construct SARCLIP-1M, a large-scale vision language dataset comprising over one million text-image pairs aggregated from existing datasets. We further introduce SARCLIP, the first vision language foundation model tailored for the SAR domain. Our SARCLIP model is trained using a contrastive vision language learning approach by domain transferring strategy, enabling it to bridge the gap between SAR imagery and textual descriptions. Extensive experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP in feature extraction and interpretation, significantly outperforming state-of-the-art foundation models and advancing the semantic understanding of SAR imagery. The code and datasets will be released soon.
Related papers
- FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery [8.62554606349568]
FUSAR-GPT is a VLM specifically for Synthetic Aperture Radar (SAR) applications.<n>It embeds multi-source remote-sensing temporal features into the model's visual backbone via'spatiotemporal anchors'<n>It achieves state-of-the-art performance across several typical remote sensing visual-language benchmark tests.
arXiv Detail & Related papers (2026-02-22T13:40:17Z) - SARMAE: Masked Autoencoder for SAR Representation Learning [17.36199520462285]
We propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning.<n>SARMAE injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning.<n>Experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks.
arXiv Detail & Related papers (2025-12-18T15:10:19Z) - SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing [13.878173189132085]
Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images.<n>Existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery.<n>This paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines.
arXiv Detail & Related papers (2025-09-28T15:03:25Z) - Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images [51.74614065919118]
This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images.<n>We propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features.<n>We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features.
arXiv Detail & Related papers (2025-08-25T14:22:57Z) - SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding [20.314150537672198]
Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction.<n>We introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality.<n>It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories, and multi-task question-answering pairs spanning seven applications and 1,012 question types.
arXiv Detail & Related papers (2025-04-04T08:09:53Z) - Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.<n>One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.<n>We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation [12.32553804641971]
Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding.<n>This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M.
arXiv Detail & Related papers (2025-02-12T07:19:36Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - PeaceGAN: A GAN-based Multi-Task Learning Method for SAR Target Image
Generation with a Pose Estimator and an Auxiliary Classifier [50.17500790309477]
We propose a novel GAN-based multi-task learning (MTL) method for SAR target image generation, called PeaceGAN.
PeaceGAN uses both pose angle and target class information, which makes it possible to produce SAR target images of desired target classes at intended pose angles.
arXiv Detail & Related papers (2021-03-29T10:03:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.