Related papers: Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams

Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams

URL: http://arxiv.org/abs/2601.09107v1
Date: Wed, 14 Jan 2026 03:11:05 GMT
Title: Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams
Authors: Lachlan Holden, Feras Dayoub, Alberto Candela, David Harvey, Tat-Jun Chin,
Abstract summary: We consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input.<n>A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce.<n>We propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks.
Score: 15.147723721875456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.

Related papers

High-fidelity 3D reconstruction for planetary exploration [0.15749416770494704]
This work explores the integration of radiance field-based methods into a unified environment reconstruction pipeline for planetary robotics.<n>Our system combines the Nerfstudio and COLMAP frameworks with a ROS2-compatible workflow capable of processing raw rover data directly from rosbag recordings.<n>The resulting pipeline established a foundation for future research in radiance field-based mapping, bridging the gap between geometric and neural representations in planetary exploration.
arXiv Detail & Related papers (2026-02-14T22:07:03Z)
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis [57.249817395828174]
We propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes with real, ground-level crowd-sourced images.<n>The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images.<n>Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks.
arXiv Detail & Related papers (2025-04-17T17:57:05Z)
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations [51.44608822712786]
Visual grounding aims to localize target objects in an image based on natural language descriptions.<n>AerialVG poses new challenges, emphe.g., appearance-based grounding is insufficient to distinguish among multiple visually similar objects.<n>We introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects.
arXiv Detail & Related papers (2025-04-10T15:13:00Z)
Learning autonomous driving from aerial imagery [67.06858775696453]
Photogrammetric simulators allow the synthesis of novel views through the transformation of pre-generated assets into novel views. We use a Neural Radiance Field (NeRF) as an intermediate representation to synthesize novel views from the point of view of a ground vehicle.
arXiv Detail & Related papers (2024-10-18T05:09:07Z)
Energy-Based Models for Cross-Modal Localization using Convolutional Transformers [52.27061799824835]
We present a novel framework for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. We propose a method using convolutional transformers that performs accurate metric-level localization in a cross-modal manner. We train our model end-to-end and demonstrate our approach achieving higher accuracy than the state-of-the-art on KITTI, Pandaset, and a custom dataset.
arXiv Detail & Related papers (2023-06-06T21:27:08Z)
LARD - Landing Approach Runway Detection -- Dataset for Vision Based Landing [2.7400353551392853]
We present a dataset of high-quality aerial images for the task of runway detection during approach and landing phases. Most of the dataset is composed of synthetic images but we also provide manually labelled images from real landing footages. This dataset paves the way for further research such as the analysis of dataset quality or the development of models to cope with the detection tasks.
arXiv Detail & Related papers (2023-04-05T08:25:55Z)
Autonomous Marker-less Rapid Aerial Grasping [5.892028494793913]
We propose a vision-based system for autonomous rapid aerial grasping. We generate a dense point cloud of the detected objects and perform geometry-based grasp planning. We show the first use of geometry-based grasping techniques with a flying platform.
arXiv Detail & Related papers (2022-11-23T16:25:49Z)
Uncertainty-aware Vision-based Metric Cross-view Geolocalization [25.87104194833264]
We present an end-to-end differentiable model that uses the ground and aerial images to predict a probability distribution over possible vehicle poses. We improve the previous state-of-the-art by a large margin even without ground or aerial data from the test region.
arXiv Detail & Related papers (2022-11-22T10:23:20Z)
Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments. Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity. We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z)
Embedding Earth: Self-supervised contrastive pre-training for dense land cover classification [61.44538721707377]
We present Embedding Earth a self-supervised contrastive pre-training method for leveraging the large availability of satellite imagery. We observe significant improvements up to 25% absolute mIoU when pre-trained with our proposed method. We find that learnt features can generalize between disparate regions opening up the possibility of using the proposed pre-training scheme.
arXiv Detail & Related papers (2022-03-11T16:14:14Z)
Solving Occlusion in Terrain Mapping with Neural Networks [7.703348666813963]
We introduce a self-supervised learning approach capable of training on real-world data without a need for ground-truth information. Our neural network is able to run in real-time on both CPU and GPU with suitable sampling rates for autonomous ground robots.
arXiv Detail & Related papers (2021-09-15T08:30:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.