MobileGeo: Exploring Hierarchical Knowledge Distillation for Resource-Efficient Cross-view Drone Geo-Localization
- URL: http://arxiv.org/abs/2510.22582v2
- Date: Wed, 05 Nov 2025 02:55:54 GMT
- Title: MobileGeo: Exploring Hierarchical Knowledge Distillation for Resource-Efficient Cross-view Drone Geo-Localization
- Authors: Jian Sun, Kangdao Liu, Chi Zhang, Chuangquan Chen, Junge Shen, Chi-Man Vong,
- Abstract summary: Cross-view geo-localization enables drone localization by matching aerial images to geo-tagged satellite databases.<n>MobileGeo is a mobile-friendly framework designed for efficient on-device CVGL.<n>MobileGeo runs at 251.5 FPS on an NVIDIA AGX Orin edge device, demonstrating its practical viability for real-time on-device drone geo-localization.
- Score: 47.16612614191333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-view geo-localization (CVGL) enables drone localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environments. However, existing methods rely on resource-intensive feature alignment and multi-branch architectures, incurring high inference costs that limit their deployment on mobile edge devices. We propose MobileGeo, a mobile-friendly framework designed for efficient on-device CVGL. MobileGeo achieves its efficiency through two key components: 1) During training, a Hierarchical Distillation (HD-CVGL) paradigm, coupled with Uncertainty-Aware Prediction Alignment (UAPA), distills essential information into a compact model without incurring inference overhead. 2) During inference, an efficient Multi-view Selection Refinement Module (MSRM) leverages mutual information to filter redundant views and reduce computational load. Extensive experiments demonstrate that MobileGeo outperforms previous state-of-the-art methods, achieving a 4.19\% improvement in AP on University-1652 dataset while being over 5$\times$ more efficient in FLOPs and 3$\times$ faster. Crucially, MobileGeo runs at 251.5 FPS on an NVIDIA AGX Orin edge device, demonstrating its practical viability for real-time on-device drone geo-localization.
Related papers
- MCOP: Multi-UAV Collaborative Occupancy Prediction [40.58729551462363]
Current Bird's Eye View (BEV)-based approaches exhibit two main limitations.<n>We propose a novel multi-UAV collaborative occupancy prediction framework.<n>Our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods.
arXiv Detail & Related papers (2025-10-14T16:17:42Z) - InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment [3.6927415209865533]
InstaGeo is an open-source framework for transforming raw satellite imagery into model-ready datasets.<n>We show how InstaGeo can transform raw imagery into model-ready datasets and derive compact, compute-efficient models.<n>We also show how InstaGeo can transform research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation.
arXiv Detail & Related papers (2025-10-07T06:57:15Z) - SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments [8.46731803518948]
Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments.<n>We present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios.<n>We propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method to overcome these limitations.
arXiv Detail & Related papers (2025-09-17T08:05:36Z) - Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection [6.443926939309045]
We propose a complete UAV-oriented solution that combines both dataset construction and model innovation.<n>First, we design a refined UAV-Label Engine, which efficiently resolves annotation redundancy, inconsistency, and ambiguity.<n>Second, we introduce the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design that integrates cross-attention, adaptive gating, and global FiLM modulation for robust textvision alignment.
arXiv Detail & Related papers (2025-09-07T10:59:02Z) - GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models [4.956977275061966]
GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset.<n>Despite this limited data, our SFT-centric approach substantially improves over baseline models.<n>Our findings highlight the power of high-quality supervision and efficient SFT for planet-scale image geolocation.
arXiv Detail & Related papers (2025-06-02T03:16:19Z) - Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images [0.9883261192383611]
In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in unstructured environments.<n>We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly.
arXiv Detail & Related papers (2025-03-23T08:25:07Z) - BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion Model [8.720833232645155]
Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving.<n>We propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses.
arXiv Detail & Related papers (2025-03-14T13:17:43Z) - STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision [3.671692919685993]
We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective observations into global map perspective representations.<n>We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan.
arXiv Detail & Related papers (2025-03-11T00:38:54Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving.<n>Recent methods leverage the range-view representation to improve processing efficiency.<n>We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - A Gis Aided Approach for Geolocalizing an Unmanned Aerial System Using
Deep Learning [0.4297070083645048]
We propose an alternative approach to geolocalize a UAS when GPS signal is degraded or denied.
Considering UAS has a downward-looking camera on its platform that can acquire real-time images as the platform flies, we apply modern deep learning techniques to achieve geolocalization.
We extract GIS information from OpenStreetMap (OSM) to semantically segment matched features into building and terrain classes.
arXiv Detail & Related papers (2022-08-25T17:51:15Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.