STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision
- URL: http://arxiv.org/abs/2503.07939v1
- Date: Tue, 11 Mar 2025 00:38:54 GMT
- Title: STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision
- Authors: Hin Wai Lui, Jeffrey L. Krichmar,
- Abstract summary: We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective observations into global map perspective representations.<n>We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan.
- Score: 3.671692919685993
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.
Related papers
- EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.
We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.
experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z) - RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning [20.688641105430467]
Global localization is crucial in autonomous driving and robotics applications when GPS signals are unreliable.
Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE)
We introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation.
We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors.
arXiv Detail & Related papers (2024-08-30T18:42:53Z) - BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation [15.324623975476348]
Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints.
We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation.
arXiv Detail & Related papers (2023-12-23T22:20:45Z) - Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via
Geometry-Guided Cross-View Transformer [66.82008165644892]
We propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image.
Experimental results demonstrate that our method significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2023-07-16T11:52:27Z) - Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.<n>Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - 4Seasons: Benchmarking Visual SLAM and Long-Term Localization for
Autonomous Driving in Challenging Conditions [54.59279160621111]
We present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset.
The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions.
We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance.
arXiv Detail & Related papers (2022-12-31T13:52:36Z) - iSimLoc: Visual Global Localization for Previously Unseen Environments
with Simulated Images [21.43167626240771]
This paper presents iSimLoc, a consistent hierarchical global re-localization approach.
Place features of iSimLoc can be utilized to search target images under changing appearances and viewpoints.
We evaluate our method on one dataset with appearance variations and one dataset that focuses on demonstrating large-scale matching over a long flight.
arXiv Detail & Related papers (2022-09-14T02:40:50Z) - Co-visual pattern augmented generative transformer learning for
automobile geo-localization [12.449657263683337]
Cross-view geo-localization (CVGL) aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial images.
We present a novel approach using cross-view knowledge generative techniques in combination with transformers, namely mutual generative transformer learning (MGTL) for CVGL.
arXiv Detail & Related papers (2022-03-17T07:29:02Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.