VLN-Trans: Translator for the Vision and Language Navigation Agent
- URL: http://arxiv.org/abs/2302.09230v1
- Date: Sat, 18 Feb 2023 04:19:51 GMT
- Title: VLN-Trans: Translator for the Vision and Language Navigation Agent
- Authors: Yue Zhang, Parisa Kordjamshidi
- Abstract summary: We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
- Score: 23.84492755669486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language understanding is essential for the navigation agent to follow
instructions. We observe two kinds of issues in the instructions that can make
the navigation task challenging: 1. The mentioned landmarks are not
recognizable by the navigation agent due to the different vision abilities of
the instructor and the modeled agent. 2. The mentioned landmarks are applicable
to multiple targets, thus not distinctive for selecting the target among the
candidate viewpoints. To deal with these issues, we design a translator module
for the navigation agent to convert the original instructions into
easy-to-follow sub-instruction representations at each step. The translator
needs to focus on the recognizable and distinctive landmarks based on the
agent's visual abilities and the observed visual environment. To achieve this
goal, we create a new synthetic sub-instruction dataset and design specific
tasks to train the translator and the navigation agent. We evaluate our
approach on Room2Room~(R2R), Room4room~(R4R), and Room2Room Last (R2R-Last)
datasets and achieve state-of-the-art results on multiple benchmarks.
Related papers
- NavHint: Vision and Language Navigation Agent with a Hint Generator [31.322331792911598]
We provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.
The hint generator assists the navigation agent in developing a global understanding of the visual environment.
We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics.
arXiv Detail & Related papers (2024-02-04T16:23:16Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Towards Versatile Embodied Navigation [120.73460380993305]
Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model.
We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
arXiv Detail & Related papers (2022-10-30T11:53:49Z) - LOViS: Learning Orientation and Visual Signals for Vision and Language
Navigation [23.84492755669486]
In this paper, we design a neural agent with explicit Orientation and Vision modules.
Those modules learn to ground spatial information and landmark mentions in the instructions to the visual environment more effectively.
We evaluate our approach on both Room2room (R2R) and Room4room (R4R) datasets and achieve the state of the art results on both benchmarks.
arXiv Detail & Related papers (2022-09-26T14:26:50Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.