Related papers: Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

URL: http://arxiv.org/abs/2603.01758v1
Date: Mon, 02 Mar 2026 11:38:12 GMT
Title: Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
Authors: Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li, Jian Yang,
Abstract summary: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
Score: 59.2578488860426
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP [21.89022894877594]
We propose a novel framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities with natural language semantics.<n>Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation.
arXiv Detail & Related papers (2026-01-13T10:44:37Z)
Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning [0.9558392439655014]
We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
arXiv Detail & Related papers (2025-10-28T11:39:22Z)
A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z)
Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection [29.24483392547041]
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all human, verb, object> triplets of interest in an image.<n>Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs)<n>We propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI)
arXiv Detail & Related papers (2025-07-09T03:16:39Z)
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions [5.67477841586604]
textbfAeroLite is a tag-guided captioning framework for remote sensing images.<n>textbfAeroLite leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset.<n>We propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings.
arXiv Detail & Related papers (2025-04-13T11:29:31Z)
Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection [21.16636753446158]
Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities.<n>We propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet.<n>We show that our approach outperforms state-of-the-art multimodal UAV object detectors.
arXiv Detail & Related papers (2025-03-10T05:53:30Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer [71.82644727907146]
We propose a novel $underlineComP$lementary $underlinetr$ansformer, $textbfComPtr$, for diverse bi-source dense prediction tasks.<n>ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer.
arXiv Detail & Related papers (2023-07-23T15:17:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.