Generative Language-Grounded Policy in Vision-and-Language Navigation
with Bayes' Rule
- URL: http://arxiv.org/abs/2009.07783v3
- Date: Thu, 8 Oct 2020 17:16:49 GMT
- Title: Generative Language-Grounded Policy in Vision-and-Language Navigation
with Bayes' Rule
- Authors: Shuhei Kurita and Kyunghyun Cho
- Abstract summary: Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node.
In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions.
In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments.
- Score: 80.0853069632445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) is a task in which an agent is embodied
in a realistic 3D environment and follows an instruction to reach the goal
node. While most of the previous studies have built and investigated a
discriminative approach, we notice that there are in fact two possible
approaches to building such a VLN agent: discriminative \textit{and}
generative. In this paper, we design and investigate a generative
language-grounded policy which uses a language model to compute the
distribution over all possible instructions i.e. all possible sequences of
vocabulary tokens given action and the transition history. In experiments, we
show that the proposed generative approach outperforms the discriminative
approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in
the unseen environments. We further show that the combination of the generative
and discriminative policies achieves close to the state-of-the art results in
the R2R dataset, demonstrating that the generative and discriminative policies
capture the different aspects of VLN.
Related papers
- Towards Explainable, Safe Autonomous Driving with Language Embeddings
for Novelty Identification and Active Learning: Framework and Experimental
Analysis with Real-World Data Sets [0.0]
This research explores the integration of language embeddings for active learning in autonomous driving datasets.
Our proposed method employs language-based representations to identify novel scenes, emphasizing the dual purpose of safety takeover responses and active learning.
arXiv Detail & Related papers (2024-02-11T22:53:21Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
Navigation [23.94546957057613]
Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN)
We propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks.
arXiv Detail & Related papers (2023-08-24T06:25:20Z) - Unifying Vision-Language Representation Space with Single-tower
Transformer [29.604520441315135]
We train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner.
We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces.
arXiv Detail & Related papers (2022-11-21T02:34:21Z) - INTERACTION: A Generative XAI Framework for Natural Language Inference
Explanations [58.062003028768636]
Current XAI approaches only focus on delivering a single explanation.
This paper proposes a generative XAI framework, INTERACTION (explaIn aNd predicT thEn queRy with contextuAl CondiTional varIational autO-eNcoder)
Our novel framework presents explanation in two steps: (step one) Explanation and Label Prediction; and (step two) Diverse Evidence Generation.
arXiv Detail & Related papers (2022-09-02T13:52:39Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images.
We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively.
We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.