VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
- URL: http://arxiv.org/abs/2507.11653v1
- Date: Tue, 15 Jul 2025 18:38:35 GMT
- Title: VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
- Authors: Hannah Shafferman, Annika Thomas, Jouko Kinnari, Michael Ricard, Jose Nino, Jonathan How,
- Abstract summary: VISTA is a novel open-set, monocular global localization framework.<n>It exploits geometric consistencies between environment maps to align reference frames.<n>We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods.
- Score: 0.2356141385409842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions -- known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.
Related papers
- Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association [0.0]
We propose a multi-camera multi-target (MCMT) tracking framework that ensures consistent global identity assignment across views.<n>The pipeline starts with BoT-SORT-based single-camera tracking, followed by an initial glance phase to initialize global IDs.<n>New global IDs are only introduced when no sufficiently similar trajectory or feature match is found.
arXiv Detail & Related papers (2025-07-14T09:57:53Z) - Environmental Change Detection: Toward a Practical Task of Scene Change Detection [23.79599379113436]
We propose a novel framework that jointly understands spatial environments and detects changes.<n>We deal with this limitation by leveraging multiple reference candidates and aggregating semantically rich representations for change detection.<n>We evaluate our framework on three standard benchmark sets reconstructed for ECD, and significantly outperform a naive combination of state-of-the-art methods.
arXiv Detail & Related papers (2025-06-13T06:09:43Z) - Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints [10.639191465547517]
Absolute localization is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable.<n> Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in-denied scenarios.<n>Existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations.<n>We introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and
arXiv Detail & Related papers (2025-06-11T13:53:03Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - ConGeo: Robust Cross-view Geo-localization across Ground View Variations [34.192775134189965]
Cross-view geo-localization aims at localizing a ground-level query image by matching it to its corresponding geo-referenced aerial view.
Existing learning pipelines are orientation-specific or FoV-specific, demanding separate model training for different ground view variations.
We propose ConGeo, a single- and cross-view Contrastive method for Geo-localization.
arXiv Detail & Related papers (2024-03-20T20:37:13Z) - CurriculumLoc: Enhancing Cross-Domain Geolocalization through
Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images.
We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification.
We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Decoupled Multi-task Learning with Cyclical Self-Regulation for Face
Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing.
Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection.
Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z) - Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking [53.668757725179056]
We propose a simple yet effective approach to adapt affinity estimations to corresponding matching scopes in MTMCT.
Instead of trying to deal with all appearance changes, we tailor the affinity metric to specialize in ones that might emerge during data associations.
Minimizing the mismatch, the adaptive affinity module brings significant improvements over global re-ID distance.
arXiv Detail & Related papers (2021-12-14T18:59:11Z) - Cross-Descriptor Visual Localization and Mapping [81.16435356103133]
Visual localization and mapping is the key technology underlying the majority of Mixed Reality and robotics systems.
We present three novel scenarios for localization and mapping which require the continuous update of feature representations.
Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms.
arXiv Detail & Related papers (2020-12-02T18:19:51Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.