Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
- URL: http://arxiv.org/abs/2603.01758v1
- Date: Mon, 02 Mar 2026 11:38:12 GMT
- Title: Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
- Authors: Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li, Jian Yang,
- Abstract summary: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
- Score: 59.2578488860426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
Related papers
- Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z) - MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP [21.89022894877594]
We propose a novel framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities with natural language semantics.<n>Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation.
arXiv Detail & Related papers (2026-01-13T10:44:37Z) - Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning [0.9558392439655014]
We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
arXiv Detail & Related papers (2025-10-28T11:39:22Z) - A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z) - Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection [29.24483392547041]
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all human, verb, object> triplets of interest in an image.<n>Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs)<n>We propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI)
arXiv Detail & Related papers (2025-07-09T03:16:39Z) - AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions [5.67477841586604]
textbfAeroLite is a tag-guided captioning framework for remote sensing images.<n>textbfAeroLite leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset.<n>We propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings.
arXiv Detail & Related papers (2025-04-13T11:29:31Z) - Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection [21.16636753446158]
Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities.<n>We propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet.<n>We show that our approach outperforms state-of-the-art multimodal UAV object detectors.
arXiv Detail & Related papers (2025-03-10T05:53:30Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer [71.82644727907146]
We propose a novel $underlineComP$lementary $underlinetr$ansformer, $textbfComPtr$, for diverse bi-source dense prediction tasks.<n>ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer.
arXiv Detail & Related papers (2023-07-23T15:17:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.