Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
- URL: http://arxiv.org/abs/2512.10956v1
- Date: Thu, 11 Dec 2025 18:59:56 GMT
- Title: Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
- Authors: Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng,
- Abstract summary: We show that relying on monocular vision and ignoring mid-level vision priors is inefficient.<n>We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking.<n>We find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data.
- Score: 13.586199223564273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
Related papers
- DeFM: Learning Foundation Representations from Depth for Robotics [49.77188649197404]
We present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications.<n>DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors.<n>It achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.
arXiv Detail & Related papers (2026-01-26T19:45:31Z) - VIMD: Monocular Visual-Inertial Motion and Depth Estimation [8.959715109842742]
We develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth.<n>At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale.<n>Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image.
arXiv Detail & Related papers (2025-09-24T02:50:55Z) - StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes [14.61785829674974]
Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping.<n>Existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences.<n>We propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module.
arXiv Detail & Related papers (2025-09-19T20:57:03Z) - Object Concepts Emerge from Motion [24.73461163778215]
We propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner.<n>Our key insight is that motion boundary serves as a strong signal for object-level grouping.<n>Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data.
arXiv Detail & Related papers (2025-05-27T18:09:02Z) - Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model [70.67610495024459]
Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps.<n>Existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments.<n>We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation.
arXiv Detail & Related papers (2025-03-30T16:24:22Z) - MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors [52.39201779505421]
MonSter++ is a foundation model for multi-view depth estimation.<n>It integrates monocular depth priors into multi-view depth estimation.<n>MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo.
arXiv Detail & Related papers (2025-01-15T08:11:24Z) - Playing to Vision Foundation Model's Strengths in Stereo Matching [13.887661472501618]
This study serves as the first exploration of a viable approach for adapting vision foundation models (VFMs) to stereo matching.
Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention.
ViTAStereo outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels.
arXiv Detail & Related papers (2024-04-09T12:34:28Z) - NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z) - Self-Supervised Depth Completion for Active Stereo [55.79929735390945]
Active stereo systems are widely used in the robotics industry due to their low cost and high quality depth maps.
These depth sensors suffer from stereo artefacts and do not provide dense depth estimates.
We present the first self-supervised depth completion method for active stereo systems that predicts accurate dense depth maps.
arXiv Detail & Related papers (2021-10-07T07:33:52Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.