Less is More: Generating Grounded Navigation Instructions from Landmarks
- URL: http://arxiv.org/abs/2111.12872v2
- Date: Mon, 29 Nov 2021 14:45:50 GMT
- Title: Less is More: Generating Grounded Navigation Instructions from Landmarks
- Authors: Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra
Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter
Anderson
- Abstract summary: We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes.
Our MARKY-MT5 system addresses this by focusing on visual landmarks.
It comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, encoder-decoder.
- Score: 71.60176664576551
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the automatic generation of navigation instructions from 360-degree
images captured on indoor routes. Existing generators suffer from poor visual
grounding, causing them to rely on language priors and hallucinate objects. Our
MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a
first stage landmark detector and a second stage generator -- a multimodal,
multilingual, multitask encoder-decoder. To train it, we bootstrap grounded
landmark annotations on top of the Room-across-Room (RxR) dataset. Using text
parsers, weak supervision from RxR's pose traces, and a multilingual image-text
encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu
landmark descriptions and ground them to specific regions in panoramas. On
Room-to-Room, human wayfinders obtain success rates (SR) of 71% following
MARKY-MT5's instructions, just shy of their 75% SR following human instructions
-- and well above SRs with other generators. Evaluations on RxR's longer,
diverse paths obtain 61-64% SRs on three languages. Generating such
high-quality navigation instructions in novel environments is a step towards
conversational navigation tools and could facilitate larger-scale training of
instruction-following agents.
Related papers
- Learning Vision-and-Language Navigation from YouTube Videos [89.1919348607439]
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions.
There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information.
We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
arXiv Detail & Related papers (2023-07-22T05:26:50Z) - VLN-Trans: Translator for the Vision and Language Navigation Agent [23.84492755669486]
We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
arXiv Detail & Related papers (2023-02-18T04:19:51Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Self-supervised 3D Semantic Representation Learning for
Vision-and-Language Navigation [30.429893959096752]
We develop a novel training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation.
We construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs.
Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset.
arXiv Detail & Related papers (2022-01-26T07:43:47Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense
Spatiotemporal Grounding [75.03682706791389]
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset.
RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets.
It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities.
arXiv Detail & Related papers (2020-10-15T18:01:15Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.