NOLO: Navigate Only Look Once
- URL: http://arxiv.org/abs/2408.01384v2
- Date: Sat, 16 Nov 2024 15:47:07 GMT
- Title: NOLO: Navigate Only Look Once
- Authors: Bohan Zhou, Zhongbin Zhang, Jiangxing Wang, Zongqing Lu,
- Abstract summary: In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner.
We propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability.
We show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.
- Score: 29.242548047719787
- License:
- Abstract: The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy. For videos and more information, visit https://sites.google.com/view/nol0.
Related papers
- VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation [59.3649071376364]
The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
arXiv Detail & Related papers (2024-02-05T22:20:19Z) - Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Learning Vision-and-Language Navigation from YouTube Videos [89.1919348607439]
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions.
There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information.
We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
arXiv Detail & Related papers (2023-07-22T05:26:50Z) - ViNG: Learning Open-World Navigation with Visual Goals [82.84193221280216]
We propose a learning-based navigation system for reaching visually indicated goals.
We show that our system, which we call ViNG, outperforms previously-proposed methods for goal-conditioned reinforcement learning.
We demonstrate ViNG on a number of real-world applications, such as last-mile delivery and warehouse inspection.
arXiv Detail & Related papers (2020-12-17T18:22:32Z) - Unsupervised Domain Adaptation for Visual Navigation [115.85181329193092]
We propose an unsupervised domain adaptation method for visual navigation.
Our method translates the images in the target domain to the source domain such that the translation is consistent with the representations learned by the navigation policy.
arXiv Detail & Related papers (2020-10-27T18:22:43Z) - Semantic Visual Navigation by Watching YouTube Videos [17.76847333440422]
This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos.
We show that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation.
We observe a relative improvement of 15-83% over end-to-end RL, behavior cloning, and classical methods, while using minimal direct interaction.
arXiv Detail & Related papers (2020-06-17T17:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.