Related papers: A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

URL: http://arxiv.org/abs/2510.20291v1
Date: Thu, 23 Oct 2025 07:23:47 GMT
Title: A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
Authors: LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin, Tianle Zhang, Yuchen Yuan, Chi Zhang, Xuelong Li,
Abstract summary: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation.<n>The task retrieves the most relevant geo-referenced image from a large multi-platform corpus.<n>We train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power.
Score: 49.13032757301023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.

Related papers

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation [13.743073097114461]
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing.<n>We propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework to guide open-vocabulary segmentation models toward precise mapping.
arXiv Detail & Related papers (2026-02-09T02:09:21Z)
VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing [59.73939718087177]
Single-encoder vision-language model trained contrastively to embed interleaved inputs in a unified vector space.<n>VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing.
arXiv Detail & Related papers (2025-12-12T11:39:35Z)
Referring Video Object Segmentation with Cross-Modality Proxy Queries [23.504655272754587]
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions.<n>Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism.<n>We propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics.
arXiv Detail & Related papers (2025-11-26T07:45:41Z)
Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning [81.43257201833154]
We propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities.<n>Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text.<n>The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets.
arXiv Detail & Related papers (2025-10-20T16:01:11Z)
A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z)
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation [8.48847068018671]
This paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network.<n>It enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS)<n>In the KPS, we design the Multiscale Linear Cross-Attention Module (MLAM), which establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions.<n>The KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies
arXiv Detail & Related papers (2025-09-16T13:26:58Z)
GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z)
SDPL: Shifting-Dense Partition Learning for UAV-View Geo-Localization [27.131867916908156]
Cross-view geo-localization aims to match images of the same target from different platforms. We introduce part-based representation learning, shifting-dense partition learning. We show that SDPL is robust to position shifting, and performs com-petitively on two prevailing benchmarks.
arXiv Detail & Related papers (2024-03-07T03:07:54Z)
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.