CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
- URL: http://arxiv.org/abs/2602.18424v1
- Date: Fri, 20 Feb 2026 18:46:27 GMT
- Title: CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
- Authors: Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich,
- Abstract summary: We introduce Capability-Conditioned Navigation (CapNav) to evaluate how well Vision-Language Models (VLMs) can navigate complex indoor spaces.<n>CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities.<n>We find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions.
- Score: 34.18391119684099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning [8.316876835702525]
Vision-Language Models (VLMs) are set to transform robotic navigation.<n>Our framework outsources high-level planning to a VLM, which leverages its contextual understanding to guide a frontier-based exploration agent.<n>When tested on challenging benchmarks like HM3D, Gibson, and MP3D, this method produces exceptionally direct and logical trajectories.
arXiv Detail & Related papers (2025-11-12T03:38:50Z) - VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning [11.140494493881075]
We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots.<n>Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration.<n>VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
arXiv Detail & Related papers (2025-02-02T21:44:15Z) - Navigation with VLM framework: Towards Going to Any Language [4.368039454973151]
Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data.<n>We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively.
arXiv Detail & Related papers (2024-09-18T02:29:00Z) - Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs.
We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.