Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
- URL: http://arxiv.org/abs/2302.06072v2
- Date: Thu, 14 Mar 2024 08:09:11 GMT
- Title: Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
- Authors: Bingqian Lin, Yi Zhu, Xiaodan Liang, Liang Lin, Jianzhuang Liu,
- Abstract summary: Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
- Score: 124.07372905781696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.
Related papers
- CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models [10.62320998365966]
Vision Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection.
We propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations.
Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations.
arXiv Detail & Related papers (2024-10-21T05:51:51Z) - Narrowing the Gap between Vision and Action in Navigation [28.753809306008996]
We introduce a low-level action decoder jointly trained with high-level action prediction.
Our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.
arXiv Detail & Related papers (2024-08-19T20:09:56Z) - A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation [3.809880620207714]
Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
arXiv Detail & Related papers (2023-05-05T15:06:08Z) - Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following [101.55727845195969]
We propose Embodied Learner Concept (ECL) in an interactive 3D environment.
A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks.
ECL is fully transparent and step-by-step interpretable in long-term planning.
arXiv Detail & Related papers (2023-04-07T17:59:34Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Neighbor-view Enhanced Model for Vision and Language Navigation [78.90859474564787]
Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions.
In this work, we propose a multi- module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views.
arXiv Detail & Related papers (2021-07-15T09:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.