Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning
- URL: http://arxiv.org/abs/2512.24404v1
- Date: Tue, 30 Dec 2025 18:36:39 GMT
- Title: Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning
- Authors: Soham Pahari, M. Srinivas,
- Abstract summary: We introduce Visual Reasoning for localization, or ViReLoc, which performs planning and localization using only visual representations.<n>The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand.<n> Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.
Related papers
- TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding [28.808796664403342]
We propose a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework.<n>TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information.<n>We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis.
arXiv Detail & Related papers (2026-02-23T12:18:26Z) - RegionReasoner: Region-Grounded Multi-Round Visual Reasoning [69.75509909581133]
RegionReasoner is a reinforcement learning framework for visual reasoning.<n>It enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes.<n>RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment.
arXiv Detail & Related papers (2026-02-03T16:52:16Z) - SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing [57.609801041296095]
Vision-language models (VLMs) are emerging as powerful tools for remote sensing.<n>We enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism.
arXiv Detail & Related papers (2025-12-09T18:15:43Z) - Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning [5.517595398768408]
We present a unified aerial VLN framework that operates solely on ego monocular RGB observations and natural language instructions.<n>This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery.
arXiv Detail & Related papers (2025-12-09T14:25:24Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms are prone to making inaccurate decisions due to their lack of visual common sense and limited reasoning capabilities.
We propose a Hierarchical Spatial Proximity Reasoning (HSPR) method to help the agent build a knowledge base of hierarchical spatial proximity.
We validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R.
arXiv Detail & Related papers (2024-03-18T07:51:22Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - Learning Concept-Based Causal Transition and Symbolic Reasoning for Visual Planning [36.131648635051334]
Visual planning simulates how humans make decisions to achieve desired goals.
We propose an interpretable and generalizable visual planning framework.
We show that our framework can generalize to unseen task trajectories, unseen object categories, and real-world data.
arXiv Detail & Related papers (2023-10-05T05:41:21Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Neural Topological SLAM for Visual Navigation [112.73876869904]
We design topological representations for space that leverage semantics and afford approximate geometric reasoning.
We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation.
arXiv Detail & Related papers (2020-05-25T17:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.