Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2308.12587v1
- Date: Thu, 24 Aug 2023 06:25:20 GMT
- Title: Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
Navigation
- Authors: Yibo Cui, Liang Xie, Yakun Zhang, Meishan Zhang, Ye Yan, Erwei Yin
- Abstract summary: Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN)
We propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks.
- Score: 23.94546957057613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal alignment is one key challenge for Vision-and-Language Navigation
(VLN). Most existing studies concentrate on mapping the global instruction or
single sub-instruction to the corresponding trajectory. However, another
critical problem of achieving fine-grained alignment at the entity level is
seldom considered. To address this problem, we propose a novel Grounded
Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve
the adaptive pre-training paradigm, we first introduce grounded entity-landmark
human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R.
Additionally, we adopt three grounded entity-landmark adaptive pre-training
objectives: 1) entity phrase prediction, 2) landmark bounding box prediction,
and 3) entity-landmark semantic alignment, which explicitly supervise the
learning of fine-grained cross-modal alignment between entity phrases and
environment landmarks. Finally, we validate our model on two downstream
benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions
(CVDN). The comprehensive experiments show that our GELA model achieves
state-of-the-art results on both tasks, demonstrating its effectiveness and
generalizability.
Related papers
- DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning [40.87681228125296]
Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction.
For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history.
arXiv Detail & Related papers (2024-04-02T14:40:04Z) - Co-guiding for Multi-intent Spoken Language Understanding [53.30511968323911]
We propose a novel model termed Co-guiding Net, which implements a two-stage framework achieving the mutual guidances between the two tasks.
For the first stage, we propose single-task supervised contrastive learning, and for the second stage, we propose co-guiding supervised contrastive learning.
Experiment results on multi-intent SLU show that our model outperforms existing models by a large margin.
arXiv Detail & Related papers (2023-11-22T08:06:22Z) - Ground then Navigate: Language-guided Navigation in Dynamic Scenes [13.870303451896248]
We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings.
We solve the problem by explicitly grounding the navigable regions corresponding to the textual command.
We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
arXiv Detail & Related papers (2022-09-24T09:51:09Z) - Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised
Referring Expression Grounding [214.8003571700285]
Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression.
We design an entity-enhanced adaptive reconstruction network (EARN)
EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction.
arXiv Detail & Related papers (2022-07-18T05:30:45Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics [6.678312249123534]
We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
arXiv Detail & Related papers (2022-02-01T07:39:04Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Grounded Situation Recognition [56.18102368133022]
We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images.
GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities.
We show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval.
arXiv Detail & Related papers (2020-03-26T17:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.