A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
- URL: http://arxiv.org/abs/2510.20291v1
- Date: Thu, 23 Oct 2025 07:23:47 GMT
- Title: A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
- Authors: LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin, Tianle Zhang, Yuchen Yuan, Chi Zhang, Xuelong Li,
- Abstract summary: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation.<n>The task retrieves the most relevant geo-referenced image from a large multi-platform corpus.<n>We train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power.
- Score: 49.13032757301023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.
Related papers
- Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation [13.743073097114461]
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing.<n>We propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework to guide open-vocabulary segmentation models toward precise mapping.
arXiv Detail & Related papers (2026-02-09T02:09:21Z) - VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing [59.73939718087177]
Single-encoder vision-language model trained contrastively to embed interleaved inputs in a unified vector space.<n>VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing.
arXiv Detail & Related papers (2025-12-12T11:39:35Z) - Referring Video Object Segmentation with Cross-Modality Proxy Queries [23.504655272754587]
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions.<n>Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism.<n>We propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics.
arXiv Detail & Related papers (2025-11-26T07:45:41Z) - Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning [81.43257201833154]
We propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities.<n>Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text.<n>The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.
arXiv Detail & Related papers (2025-10-20T16:01:11Z) - A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z) - TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation [8.48847068018671]
This paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network.<n>It enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS)<n>In the KPS, we design the Multiscale Linear Cross-Attention Module (MLAM), which establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions.<n>The KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies
arXiv Detail & Related papers (2025-09-16T13:26:58Z) - GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - SDPL: Shifting-Dense Partition Learning for UAV-View Geo-Localization [27.131867916908156]
Cross-view geo-localization aims to match images of the same target from different platforms.
We introduce part-based representation learning, shifting-dense partition learning.
We show that SDPL is robust to position shifting, and performs com-petitively on two prevailing benchmarks.
arXiv Detail & Related papers (2024-03-07T03:07:54Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.