Learning Vision-and-Language Navigation from YouTube Videos
- URL: http://arxiv.org/abs/2307.11984v1
- Date: Sat, 22 Jul 2023 05:26:50 GMT
- Title: Learning Vision-and-Language Navigation from YouTube Videos
- Authors: Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan,
Chuang Gan
- Abstract summary: Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions.
There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information.
We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
- Score: 89.1919348607439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) requires an embodied agent to navigate
in realistic 3D environments using natural language instructions. Existing VLN
methods suffer from training on small-scale environments or unreasonable
path-instruction datasets, limiting the generalization to unseen environments.
There are massive house tour videos on YouTube, providing abundant real
navigation experiences and layout information. However, these videos have not
been explored for VLN before. In this paper, we propose to learn an agent from
these videos by creating a large-scale dataset which comprises reasonable
path-instruction pairs from house tour videos and pre-training the agent on it.
To achieve this, we have to tackle the challenges of automatically constructing
path-instruction pairs and exploiting real layout knowledge from raw and
unlabeled videos. To address these, we first leverage an entropy-based method
to construct the nodes of a path trajectory. Then, we propose an action-aware
generator for generating instructions from unlabeled trajectories. Last, we
devise a trajectory judgment pretext task to encourage the agent to mine the
layout knowledge. Experimental results show that our method achieves
state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code
is available at https://github.com/JeremyLinky/YouTube-VLN
Related papers
- NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation [59.3649071376364]
The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
arXiv Detail & Related papers (2024-02-05T22:20:19Z) - Detours for Navigating Instructional Videos [58.1645668396789]
We propose VidDetours, a video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's.
We show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
arXiv Detail & Related papers (2024-01-03T16:38:56Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.