History Aware Multimodal Transformer for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2110.13309v2
- Date: Thu, 17 Aug 2023 22:42:07 GMT
- Title: History Aware Multimodal Transformer for Vision-and-Language Navigation
- Authors: Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev
- Abstract summary: Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
- Score: 96.80655332881432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) aims to build autonomous visual agents
that follow instructions and navigate in real scenes. To remember previously
visited locations and actions taken, most approaches to VLN implement memory
using recurrent states. Instead, we introduce a History Aware Multimodal
Transformer (HAMT) to incorporate a long-horizon history into multimodal
decision making. HAMT efficiently encodes all the past panoramic observations
via a hierarchical vision transformer (ViT), which first encodes individual
images with ViT, then models spatial relation between images in a panoramic
observation and finally takes into account temporal relation between panoramas
in the history. It, then, jointly combines text, history and current
observation to predict the next action. We first train HAMT end-to-end using
several proxy tasks including single step action prediction and spatial
relation prediction, and then use reinforcement learning to further improve the
navigation policy. HAMT achieves new state of the art on a broad range of VLN
tasks, including VLN with fine-grained instructions (R2R, RxR), high-level
instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN
(R4R, R2R-Back). We demonstrate HAMT to be particularly effective for
navigation tasks with longer trajectories.
Related papers
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - HOP: History-and-Order Aware Pre-training for Vision-and-Language
Navigation [33.38079488853708]
Previous pre-training methods for Vision-and-Language Navigation (VLN) either lack the ability to predict future actions or ignore contexts.
We propose a novel pre-training paradigm that exploit the past observations and support future action prediction.
Our navigation action prediction is also enhanced by the task of Action Prediction with History.
arXiv Detail & Related papers (2022-03-22T10:17:12Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.