Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation
- URL: http://arxiv.org/abs/2309.03661v3
- Date: Thu, 14 Dec 2023 10:03:52 GMT
- Title: Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation
- Authors: Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu, Quanjun Yin
- Abstract summary: We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
- Score: 19.793659852435486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained visual-language models have extensive world knowledge and are
widely used in visual and language navigation (VLN). However, they are not
sensitive to indoor scenarios for VLN tasks. Another challenge for VLN is how
the agent understands the contextual relations between actions on a path and
performs cross-modal alignment sequentially. In this paper, we propose a novel
Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address
these problems. It performs prompting in two stages. In the indoor-aware stage,
we apply an efficient tuning paradigm to learn deep visual prompts from an
indoor dataset, in order to augment pretrained models with inductive biases
towards indoor environments. This can enable more sample-efficient adaptation
for VLN agents. Furthermore, in the context-aware stage, we design a set of
hard context prompts to capture the sequence-level semantics in the
instruction. They enable further tuning of the pretrained models via
contrastive learning. Experimental results on both R2R and REVERIE show the
superiority of PANDA compared to existing state-of-the-art methods.
Related papers
- Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa)
The learnable external layer is built upon valid embeddings of pre-trained CLIP.
Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z) - DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation [19.793659852435486]
We propose a novel and model-agnostic domain-aware prompt learning (DAP) framework for VLN tasks.
DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics.
Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2023-11-29T17:03:37Z) - Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment [15.180715595425864]
We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs)
With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling.
Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization.
arXiv Detail & Related papers (2023-09-08T06:51:15Z) - Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
Navigation Instruction Generation [70.76686546473994]
We introduce a novel speaker model textscKefa for navigation instruction generation.
The proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
arXiv Detail & Related papers (2023-07-25T09:39:59Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts [92.92047324641622]
We propose modAlity-aligneD Action PrompTs (ADAPT) for Vision-Language Navigation (VLN)
ADAPT provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment.
Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.
arXiv Detail & Related papers (2022-05-31T02:41:31Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.