Improving Vision-and-Language Navigation by Generating Future-View Image
Semantics
- URL: http://arxiv.org/abs/2304.04907v1
- Date: Tue, 11 Apr 2023 00:36:02 GMT
- Title: Improving Vision-and-Language Navigation by Generating Future-View Image
Semantics
- Authors: Jialu Li, Mohit Bansal
- Abstract summary: Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions.
We propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG)
We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step.
- Score: 96.8435716885159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-Language Navigation (VLN) is the task that requires an agent to
navigate through the environment based on natural language instructions. At
each step, the agent takes the next action by selecting from a set of navigable
locations. In this paper, we aim to take one step further and explore whether
the agent can benefit from generating the potential future view during
navigation. Intuitively, humans will have an expectation of how the future
environment will look like, based on the natural language instructions and
surrounding views, which will aid correct navigation. Hence, to equip the agent
with this ability to generate the semantics of future navigation views, we
first propose three proxy tasks during the agent's in-domain pre-training:
Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action
Prediction with Image Generation (APIG). These three objectives teach the model
to predict missing views in a panorama (MPM), predict missing steps in the full
trajectory (MTM), and generate the next view based on the full instruction and
navigation history (APIG), respectively. We then fine-tune the agent on the VLN
task with an auxiliary loss that minimizes the difference between the view
semantics generated by the agent and the ground truth view semantics of the
next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both
the Room-to-Room dataset and the CVDN dataset. We further show that our agent
learns to fill in missing patches in future views qualitatively, which brings
more interpretability over agents' predicted actions. Lastly, we demonstrate
that learning to predict future view semantics also enables the agent to have
better performance on longer paths.
Related papers
- Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation [41.38630220744729]
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments.
For better navigation planning, the lookahead exploration strategy aims to effectively evaluate the agent's next action by accurately anticipating the future environment of candidate locations.
arXiv Detail & Related papers (2024-04-02T13:36:03Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - Masked Path Modeling for Vision-and-Language Navigation [41.7517631477082]
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
arXiv Detail & Related papers (2023-05-23T17:20:20Z) - HOP: History-and-Order Aware Pre-training for Vision-and-Language
Navigation [33.38079488853708]
Previous pre-training methods for Vision-and-Language Navigation (VLN) either lack the ability to predict future actions or ignore contexts.
We propose a novel pre-training paradigm that exploit the past observations and support future action prediction.
Our navigation action prediction is also enhanced by the task of Action Prediction with History.
arXiv Detail & Related papers (2022-03-22T10:17:12Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.