Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation
- URL: http://arxiv.org/abs/2203.16586v1
- Date: Wed, 30 Mar 2022 18:15:26 GMT
- Title: Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation
- Authors: Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, Wenguan Wang
- Abstract summary: We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
- Score: 172.15808300686584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the rise of vision-language navigation (VLN), great progress has been
made in instruction following -- building a follower to navigate environments
under the guidance of instructions. However, far less attention has been paid
to the inverse task: instruction generation -- learning a speaker~to generate
grounded descriptions for navigation routes. Existing VLN methods train a
speaker independently and often treat it as a data augmentation tool to
strengthen the follower while ignoring rich cross-task relations. Here we
describe an approach that learns the two tasks simultaneously and exploits
their intrinsic correlations to boost the training of each: the follower judges
whether the speaker-created instruction explains the original navigation route
correctly, and vice versa. Without the need of aligned instruction-path pairs,
such cycle-consistent learning scheme is complementary to task-specific
training targets defined on labeled data, and can also be applied over
unlabeled paths (sampled without paired instructions). Another agent,
called~creator is added to generate counterfactual environments. It greatly
changes current scenes yet leaves novel items -- which are vital for the
execution of original instructions -- unchanged. Thus more informative training
scenes are synthesized and the three agents compose a powerful VLN learning
system. Extensive experiments on a standard benchmark show that our approach
improves the performance of various follower models and produces accurate
navigation instructions.
Related papers
- From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Adversarial Reinforced Instruction Attacker for Robust Vision-Language
Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks.
We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.