Related papers: DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

URL: http://arxiv.org/abs/2509.14565v1
Date: Thu, 18 Sep 2025 02:57:28 GMT
Title: DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising
Authors: Li Gao, Hongyang Sun, Liu Liu, Yunhao Li, Yang Cai,
Abstract summary: We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models.<n>Our work proves that DiffVL can enable scalable localization by treating noisy GPS as a generative prior.
Score: 23.54747289630525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.

Related papers

SegLocNet: Multimodal Localization Network for Autonomous Driving via Bird's-Eye-View Segmentation [0.0]
SegLocNet is a multimodal-free localization network that achieves precise localization using semantic segmentation.<n>Our method can accurately estimate the ego pose in urban environments without relying on generalization.<n>Our code and pre-trained model will be released publicly.
arXiv Detail & Related papers (2025-02-27T13:34:55Z)
TopoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior [70.84644266024571]
We propose to train a perception model to "see" standard definition maps (SDMaps) We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information. Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology.
arXiv Detail & Related papers (2024-11-22T06:13:42Z)
MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps [8.373285397029884]
Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. We propose a novel transformer-based neural re-localization method, inspired by image registration. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets.
arXiv Detail & Related papers (2024-07-11T14:51:18Z)
Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps [51.24861159115138]
Standard Definition (SD) maps are more affordable and have worldwide coverage, offering a scalable alternative. We propose a novel framework to integrate SD maps into online map prediction and propose a Transformer-based encoder, SD Map Representations from transFormers. This enhancement consistently and significantly boosts (by up to 60%) lane detection and topology prediction on current state-of-the-art online map prediction methods.
arXiv Detail & Related papers (2023-11-07T15:42:22Z)
U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization [81.76044207714637]
Relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails.<n>Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance.<n>This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features.
arXiv Detail & Related papers (2023-10-20T18:57:38Z)
EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized Maps [9.450650025266379]
We present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods. We employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation. We adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses.
arXiv Detail & Related papers (2023-07-18T06:07:25Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view. Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z)
A Survey on Visual Map Localization Using LiDARs and Cameras [0.0]
We define visual map localization as a two-stage process. At the stage of place recognition, the initial position of the vehicle in the map is determined by comparing the visual sensor output with a set of geo-tagged map regions of interest. At the stage of map metric localization, the vehicle is tracked while it moves across the map by continuously aligning the visual sensors' output with the current area of the map that is being traversed.
arXiv Detail & Related papers (2022-08-05T20:11:18Z)
Semantic Image Alignment for Vehicle Localization [111.59616433224662]
We present a novel approach to vehicle localization in dense semantic maps using semantic segmentation from a monocular camera. In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors.
arXiv Detail & Related papers (2021-10-08T14:40:15Z)
Coarse-to-fine Semantic Localization with HD Map for Autonomous Driving in Structural Scenes [1.1024591739346292]
We propose a cost-effective vehicle localization system with HD map for autonomous driving using cameras as primary sensors. We formulate vision-based localization as a data association problem that maps visual semantics to landmarks in HD map. We evaluate our method on two datasets and demonstrate that the proposed approach yields promising localization results in different driving scenarios.
arXiv Detail & Related papers (2021-07-06T11:58:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.