CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations
- URL: http://arxiv.org/abs/2207.02185v1
- Date: Tue, 5 Jul 2022 17:38:59 GMT
- Title: CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations
- Authors: Jialu Li, Hao Tan, Mohit Bansal
- Abstract summary: Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
- Score: 98.30038910061894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) tasks require an agent to navigate
through the environment based on language instructions. In this paper, we aim
to solve two key challenges in this task: utilizing multilingual instructions
for improved instruction-path grounding and navigating through new environments
that are unseen during training. To address these challenges, we propose CLEAR:
Cross-Lingual and Environment-Agnostic Representations. First, our agent learns
a shared and visually-aligned cross-lingual language representation for the
three languages (English, Hindi and Telugu) in the Room-Across-Room dataset.
Our language representation learning is guided by text pairs that are aligned
by visual information. Second, our agent learns an environment-agnostic visual
representation by maximizing the similarity between semantically-aligned image
pairs (with constraints on object-matching) from different environments. Our
environment agnostic visual representation can mitigate the environment bias
induced by low-level visual information. Empirically, on the Room-Across-Room
dataset, we show that our multilingual agent gets large improvements in all
metrics over the strong baseline model when generalizing to unseen environments
with the cross-lingual language representation and the environment-agnostic
visual representation. Furthermore, we show that our learned language and
visual representations can be successfully transferred to the Room-to-Room and
Cooperative Vision-and-Dialogue Navigation task, and present detailed
qualitative and quantitative generalization and grounding analysis. Our code is
available at https://github.com/jialuli-luka/CLEAR
Related papers
- LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement
Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks?
We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks)
We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - Graph based Environment Representation for Vision-and-Language
Navigation in Continuous Environments [20.114506226598508]
Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a navigation task that requires an agent to follow a language instruction in a realistic environment.
We propose a new environment representation in order to solve the above problems.
arXiv Detail & Related papers (2023-01-11T08:04:18Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z) - VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement
Learning [14.553086325168803]
We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment)
We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment.
The goal is to force an RL agent to use both text and visual features to predict natural language action commands for solving the final task of cooking a meal.
arXiv Detail & Related papers (2020-10-26T18:51:02Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.