Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
Navigation Instruction Generation
- URL: http://arxiv.org/abs/2307.13368v1
- Date: Tue, 25 Jul 2023 09:39:59 GMT
- Title: Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
Navigation Instruction Generation
- Authors: Haitian Zeng, Xiaohan Wang, Wenguan Wang, Yi Yang
- Abstract summary: We introduce a novel speaker model textscKefa for navigation instruction generation.
The proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
- Score: 70.76686546473994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel speaker model \textsc{Kefa} for navigation instruction
generation. The existing speaker models in Vision-and-Language Navigation
suffer from the large domain gap of vision features between different
environments and insufficient temporal grounding capability. To address the
challenges, we propose a Knowledge Refinement Module to enhance the feature
representation with external knowledge facts, and an Adaptive Temporal
Alignment method to enforce fine-grained alignment between the generated
instructions and the observation sequences. Moreover, we propose a new metric
SPICE-D for navigation instruction evaluation, which is aware of the
correctness of direction phrases. The experimental results on R2R and UrbanWalk
datasets show that the proposed KEFA speaker achieves state-of-the-art
instruction generation performance for both indoor and outdoor scenes.
Related papers
- Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z) - FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation [45.99831101677059]
We present textscfoam, a textscFollower-textscaware speaker textscModel that is constantly updated given the follower feedback.
We optimize the speaker using a bi-level optimization framework and obtain its training signals by evaluating the follower on labeled data.
arXiv Detail & Related papers (2022-06-09T06:11:07Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Towards Navigation by Reasoning over Spatial Configurations [20.324906029170457]
We show the importance of spatial semantics in grounding navigation instructions into visual perceptions.
We propose a neural agent that uses the elements of spatial configurations and investigate their influence on the navigation agent's reasoning ability.
arXiv Detail & Related papers (2021-05-14T14:04:23Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.