VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2402.03561v2
- Date: Wed, 7 Feb 2024 18:02:51 GMT
- Title: VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation
- Authors: Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal
- Abstract summary: The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
- Score: 59.3649071376364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate
through realistic 3D outdoor environments based on natural language
instructions. The performance of existing VLN methods is limited by
insufficient diversity in navigation environments and limited training data. To
address these issues, we propose VLN-Video, which utilizes the diverse outdoor
environments present in driving videos in multiple cities in the U.S. augmented
with automatically generated navigation instructions and actions to improve
outdoor VLN performance. VLN-Video combines the best of intuitive classical
approaches and modern deep learning techniques, using template infilling to
generate grounded navigation instructions, combined with an image rotation
similarity-based navigation action predictor to obtain VLN style data from
driving videos for pretraining deep learning VLN models. We pre-train the model
on the Touchdown dataset and our video-augmented dataset created from driving
videos with three proxy tasks: Masked Language Modeling, Instruction and
Trajectory Matching, and Next Action Prediction, so as to learn
temporally-aware and visually-aligned instruction representations. The learned
instruction representation is adapted to the state-of-the-art navigator when
fine-tuning on the Touchdown dataset. Empirical results demonstrate that
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in
task completion rate, achieving a new state-of-the-art on the Touchdown
dataset.
Related papers
- Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - Learning Vision-and-Language Navigation from YouTube Videos [89.1919348607439]
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions.
There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information.
We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
arXiv Detail & Related papers (2023-07-22T05:26:50Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - Learning from Unlabeled 3D Environments for Vision-and-Language
Navigation [87.03299519917019]
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions.
We propose to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D.
We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models.
arXiv Detail & Related papers (2022-08-24T21:50:20Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Multimodal Text Style Transfer for Outdoor Vision-and-Language
Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks.
We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset.
Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.