VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
- URL: http://arxiv.org/abs/2502.00931v2
- Date: Mon, 10 Feb 2025 06:05:38 GMT
- Title: VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
- Authors: Yi Du, Taimeng Fu, Zhuoqun Chen, Bowen Li, Shaoshu Su, Zhipeng Zhao, Chen Wang,
- Abstract summary: We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots.
Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration.
VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
- Score: 11.140494493881075
- License:
- Abstract: Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as "find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration. This approach enables robust navigation to human-instructed instances across diverse environments. We deploy VL-Nav on a four-wheel mobile robot and evaluate its performance through comprehensive navigation tasks in both indoor and outdoor environments, spanning different scales and semantic complexities. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
Related papers
- Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach [5.009635912655658]
This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture.
HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation.
Experiments were conducted in simulated environments, using both wheeled and legged robots.
arXiv Detail & Related papers (2025-01-31T19:03:33Z) - AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans [2.940962519388297]
We propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN)
AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles.
We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
arXiv Detail & Related papers (2024-11-27T17:36:08Z) - CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction [19.997935470257794]
We present CANVAS, a framework that combines visual and linguistic instructions for commonsense-aware navigation.
Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior.
Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments.
arXiv Detail & Related papers (2024-10-02T06:34:45Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Gesture2Path: Imitation Learning for Gesture-aware Navigation [54.570943577423094]
We present Gesture2Path, a novel social navigation approach that combines image-based imitation learning with model-predictive control.
We deploy our method on real robots and showcase the effectiveness of our approach for the four gestures-navigation scenarios.
arXiv Detail & Related papers (2022-09-19T23:05:36Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z) - Robot Perception enables Complex Navigation Behavior via Self-Supervised
Learning [23.54696982881734]
We propose an approach to unify successful robot perception systems for active target-driven navigation tasks via reinforcement learning (RL)
Our method temporally incorporates compact motion and visual perception data, directly obtained using self-supervision from a single image sequence.
We demonstrate our approach on two real-world driving dataset, KITTI and Oxford RobotCar, using the new interactive CityLearn framework.
arXiv Detail & Related papers (2020-06-16T07:45:47Z) - APPLD: Adaptive Planner Parameter Learning from Demonstration [48.63930323392909]
We introduce APPLD, Adaptive Planner Learning from Demonstration, that allows existing navigation systems to be successfully applied to new complex environments.
APPLD is verified on two robots running different navigation systems in different environments.
Experimental results show that APPLD can outperform navigation systems with the default and expert-tuned parameters, and even the human demonstrator themselves.
arXiv Detail & Related papers (2020-03-31T21:15:16Z) - Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans.
Our approach is enabled by our novel data-generation tool, HumANav.
We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.