NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2402.15852v7
- Date: Sun, 30 Jun 2024 11:14:13 GMT
- Title: NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
- Authors: Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang,
- Abstract summary: NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
- Score: 23.72290930234063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.
Related papers
- Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology [38.2096731046639]
Recent efforts in UAV vision-language navigation predominantly adopt ground-based VLN settings.
We propose solutions from three perspectives: platform, benchmark, and methodology.
arXiv Detail & Related papers (2024-10-09T17:29:01Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Navigation with VLM framework: Go to Any Language [2.9869976373921916]
Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data.
We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes.
We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator.
arXiv Detail & Related papers (2024-09-18T02:29:00Z) - Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs.
We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - AerialVLN: Vision-and-Language Navigation for UAVs [23.40363176320464]
We propose a new task named AerialVLN, which is UAV-based and towards outdoor environments.
We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios.
We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task.
arXiv Detail & Related papers (2023-08-13T09:55:04Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Vision-Language Navigation with Random Environmental Mixup [112.94609558723518]
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction.
Previous works have proposed various data augmentation methods to reduce data bias.
We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
arXiv Detail & Related papers (2021-06-15T04:34:26Z) - Sim-to-Real Transfer for Vision-and-Language Navigation [70.86250473583354]
We study the problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions.
Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation.
To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot.
arXiv Detail & Related papers (2020-11-07T16:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.