Masked Path Modeling for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2305.14268v1
- Date: Tue, 23 May 2023 17:20:20 GMT
- Title: Masked Path Modeling for Vision-and-Language Navigation
- Authors: Zi-Yi Dou, Feng Gao, Nanyun Peng
- Abstract summary: Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
- Score: 41.7517631477082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language navigation (VLN) agents are trained to navigate in
real-world environments by following natural language instructions. A major
challenge in VLN is the limited availability of training data, which hinders
the models' ability to generalize effectively. Previous approaches have
attempted to address this issue by introducing additional supervision during
training, often requiring costly human-annotated data that restricts
scalability. In this paper, we introduce a masked path modeling (MPM)
objective, which pretrains an agent using self-collected data for downstream
navigation tasks. Our proposed method involves allowing the agent to actively
explore navigation environments without a specific goal and collect the paths
it traverses. Subsequently, we train the agent on this collected data to
reconstruct the original path given a randomly masked subpath. This way, the
agent can actively accumulate a diverse and substantial amount of data while
learning conditional action generation. To evaluate the effectiveness of our
technique, we conduct experiments on various VLN datasets and demonstrate the
versatility of MPM across different levels of instruction complexity. Our
results exhibit significant improvements in success rates, with enhancements of
1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room,
Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we
conduct an analysis that highlights the potential for additional improvements
when the agent is allowed to explore unseen environments prior to testing.
Related papers
- TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM.
It adapts LLMs to embodied navigation by introducing schema-based instruction.
We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Bridging the Gap Between Learning in Discrete and Continuous
Environments for Vision-and-Language Navigation [41.334731014665316]
Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments.
We propose a predictor to generate a set of candidate waypoints during navigation.
We show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions.
arXiv Detail & Related papers (2022-03-05T14:56:14Z) - Multitask Adaptation by Retrospective Exploration with Learned World
Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage.
The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.