Related papers: VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

URL: http://arxiv.org/abs/2512.11490v1
Date: Fri, 12 Dec 2025 11:39:35 GMT
Title: VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
Authors: Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg,
Abstract summary: Single-encoder vision-language model trained contrastively to embed interleaved inputs in a unified vector space.<n>VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing.
Score: 59.73939718087177
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Related papers

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval [0.0]
LLandMark is a modular framework for landmark-aware multimodal video retrieval.<n>The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis.<n>A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes.
arXiv Detail & Related papers (2026-03-03T11:36:34Z)
Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [58.46663983451155]
PixSearch is an end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning.<n>During encoding, PixSearch emits search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries.<n>On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization.
arXiv Detail & Related papers (2026-01-27T00:46:08Z)
A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization [49.13032757301023]
We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation.<n>The task retrieves the most relevant geo-referenced image from a large multi-platform corpus.<n>We train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power.
arXiv Detail & Related papers (2025-10-23T07:23:47Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery [10.196580289786414]
Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization.<n>We introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description.<n>Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.
arXiv Detail & Related papers (2025-09-23T06:23:56Z)
Saccadic Vision for Fine-Grained Visual Classification [10.681604440788854]
Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features.<n>Existing part-based methods rely on complex localization networks that learn mappings from pixel to sample space.<n>We propose a two-stage process that first extracts peripheral features and generates a sample map.<n>We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations.
arXiv Detail & Related papers (2025-09-19T07:03:37Z)
Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement [21.192114177279695]
Cross-view geo-localization (CVGL) aims to match images of the same geographic location captured from different perspectives, such as drones and satellites.<n>CVGL remains highly challenging due to significant appearance changes and spatial distortions caused by viewpoint variations.<n>We propose $textbfCVD$, a new CVGL framework that explicitly disentangles $textitcontent$ and $textitviewpoint$ factors.
arXiv Detail & Related papers (2025-05-17T04:10:32Z)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning. The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms. Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z)
Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation [34.786268652516355]
Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data. We propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning.
arXiv Detail & Related papers (2023-11-21T15:39:21Z)
Hierarchical Matching and Reasoning for Multi-Query Image Retrieval [113.44470784756308]
We propose a novel Hierarchical Matching and Reasoning Network (HMRN) for Multi-Query Image Retrieval (MQIR) It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations. Our HMRN substantially surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2023-06-26T07:03:56Z)
A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets. We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)
AutoPose: Searching Multi-Scale Branch Aggregation for Pose Estimation [96.29533512606078]
We present AutoPose, a novel neural architecture search(NAS) framework. It is capable of automatically discovering multiple parallel branches of cross-scale connections towards accurate and high-resolution 2D human pose estimation.
arXiv Detail & Related papers (2020-08-16T22:27:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.