CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double
Back-Translation for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2103.00852v2
- Date: Mon, 21 Aug 2023 12:08:58 GMT
- Title: CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double
Back-Translation for Vision-and-Language Navigation
- Authors: Aly Magassouba, Komei Sugiura, and Hisashi Kawai
- Abstract summary: Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interact naturally with users.
This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction.
We propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path.
- Score: 11.318892271652695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Navigation guided by natural language instructions is particularly suitable
for Domestic Service Robots that interacts naturally with users. This task
involves the prediction of a sequence of actions that leads to a specified
destination given a natural language navigation instruction. The task thus
requires the understanding of instructions, such as ``Walk out of the bathroom
and wait on the stairs that are on the right''. The Visual and Language
Navigation remains challenging, notably because it requires the exploration of
the environment and at the accurate following of a path specified by the
instructions to model the relationship between language and vision. To address
this, we propose the CrossMap Transformer network, which encodes the linguistic
and visual features to sequentially generate a path. The CrossMap transformer
is tied to a Transformer-based speaker that generates navigation instructions.
The two networks share common latent features, for mutual enhancement through a
double back translation model: Generated paths are translated into instructions
while generated instructions are translated into path The experimental results
show the benefits of our approach in terms of instruction understanding and
instruction generation.
Related papers
- $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - VLN-Trans: Translator for the Vision and Language Navigation Agent [23.84492755669486]
We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
arXiv Detail & Related papers (2023-02-18T04:19:51Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Generating Landmark Navigation Instructions from Maps as a Graph-to-Text
Problem [15.99072005190786]
We present a neural model that takes OpenStreetMap representations as input and learns to generate navigation instructions.
Our work is based on a novel dataset of 7,672 crowd-sourced instances that have been verified by human navigation in Street View.
arXiv Detail & Related papers (2020-12-30T21:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.