Related papers: SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing

SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing

URL: http://arxiv.org/abs/2509.23927v1
Date: Sun, 28 Sep 2025 15:03:25 GMT
Title: SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing
Authors: Yi Yang, Xiaokun Zhang, Qingchen Fang, Ziqi Ye, Rui Li, Li Liu, Haipeng Wang,
Abstract summary: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images.<n>Existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery.<n>This paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines.
Score: 13.878173189132085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP's large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

Related papers

StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z)
FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery [8.62554606349568]
FUSAR-GPT is a VLM specifically for Synthetic Aperture Radar (SAR) applications.<n>It embeds multi-source remote-sensing temporal features into the model's visual backbone via'spatiotemporal anchors'<n>It achieves state-of-the-art performance across several typical remote sensing visual-language benchmark tests.
arXiv Detail & Related papers (2026-02-22T13:40:17Z)
SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery [46.87845911116779]
We introduce SARCLIP, the first vision language foundation model tailored for the SAR domain.<n>SARCLIP is trained using a contrastive vision language learning approach by domain transferring strategy.<n>Experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP.
arXiv Detail & Related papers (2025-10-26T13:04:50Z)
Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain [58.46450049579116]
We propose a knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain.<n> Specifically, the Contourlet transform is first applied to decompose the data into low- and high-frequency subbands.<n>A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components.
arXiv Detail & Related papers (2025-07-08T04:50:28Z)
CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation [3.5464435279468907]
Jilin-1 is the world's largest sub-meter-level commercial RS satellite constellation.<n>This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics.
arXiv Detail & Related papers (2025-07-01T01:05:18Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation [67.23953699167274]
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO)<n>In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery.<n>We propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance.
arXiv Detail & Related papers (2025-04-09T15:13:26Z)
RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation [24.48561340129571]
RingMoE is a unified RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites.<n>It has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
arXiv Detail & Related papers (2025-04-04T04:47:54Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery [35.550999964460466]
We present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing dataset with 21.5 million temporal sequences. To our best knowledge, SkySense is the largest Multi-Modal to date, whose modules can be flexibly combined or used individually to accommodate various tasks.
arXiv Detail & Related papers (2023-12-15T09:57:21Z)
Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture [23.375515181854254]
Self-Supervised Learning (SSL) methods can achieve various SAR Automatic Target Recognition (ATR) tasks with pre-training in large-scale unlabeled data. SSL aims to construct supervision signals directly from the data, which minimizes the need for expensive expert annotation. This study investigates an effective SSL method for SAR ATR, which can pave the way for a foundation model in SAR ATR.
arXiv Detail & Related papers (2023-11-26T01:05:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.