Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information
- URL: http://arxiv.org/abs/2104.09580v1
- Date: Mon, 19 Apr 2021 19:18:41 GMT
- Title: Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information
- Authors: Jialu Li, Hao Tan, Mohit Bansal
- Abstract summary: Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
- Score: 83.62098382773266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language navigation is the task that requires an agent to navigate
through a 3D environment based on natural language instructions. One key
challenge in this task is to ground instructions with the current visual
information that the agent perceives. Most of the existing work employs soft
attention over individual words to locate the instruction required for the next
action. However, different words have different functions in a sentence (e.g.,
modifiers convey attributes, verbs convey actions). Syntax information like
dependencies and phrase structures can aid the agent to locate important parts
of the instruction. Hence, in this paper, we propose a navigation agent that
utilizes syntax information derived from a dependency tree to enhance alignment
between the instruction and the current visual scenes. Empirically, our agent
outperforms the baseline model that does not use syntax information on the
Room-to-Room dataset, especially in the unseen environment. Besides, our agent
achieves the new state-of-the-art on Room-Across-Room dataset, which contains
instructions in 3 languages (English, Hindi, and Telugu). We also show that our
agent is better at aligning instructions with the current visual information
via qualitative visualizations. Code and models:
https://github.com/jialuli-luka/SyntaxVLN
Related papers
- NavHint: Vision and Language Navigation Agent with a Hint Generator [31.322331792911598]
We provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.
The hint generator assists the navigation agent in developing a global understanding of the visual environment.
We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics.
arXiv Detail & Related papers (2024-02-04T16:23:16Z) - VLN-Trans: Translator for the Vision and Language Navigation Agent [23.84492755669486]
We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
arXiv Detail & Related papers (2023-02-18T04:19:51Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.