Related papers: VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

URL: http://arxiv.org/abs/2402.03561v2
Date: Wed, 7 Feb 2024 18:02:51 GMT
Title: VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
Authors: Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal
Abstract summary: The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions. VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
Score: 59.3649071376364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded navigation instructions, combined with an image rotation similarity-based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigator when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.

Related papers

OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation [49.697035403548966]
Vision-Language Navigation (VLN) aims to guide agents through an environment by leveraging both language instructions and visual cues. We propose OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. We construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. The corresponding visual data are generated using various rendering engines and advanced techniques, including Unreal, GTA V, Google Earth, and 3D Splatting (3D GS)
arXiv Detail & Related papers (2025-02-25T09:57:18Z)
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM [55.79954652783797]
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. Previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles. We propose NavRAG, a retrieval-augmented generation framework that generates user demand instructions for VLN.
arXiv Detail & Related papers (2025-02-16T14:17:36Z)
NOLO: Navigate Only Look Once [29.242548047719787]
In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner. We propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability. We show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.
arXiv Detail & Related papers (2024-08-02T16:41:34Z)
Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT) Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z)
Learning Vision-and-Language Navigation from YouTube Videos [89.1919348607439]
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
arXiv Detail & Related papers (2023-07-22T05:26:50Z)
ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset. It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z)
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation [87.03299519917019]
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. We propose to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models.
arXiv Detail & Related papers (2022-08-24T21:50:20Z)
Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Recent methods explore pretraining to improve generalization of VLN agents. We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.