ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
- URL: http://arxiv.org/abs/2205.15509v1
- Date: Tue, 31 May 2022 02:41:31 GMT
- Title: ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
- Authors: Bingqian Lin, Yi Zhu, Zicong Chen, Xiwen Liang, Jianzhuang Liu,
Xiaodan Liang
- Abstract summary: We propose modAlity-aligneD Action PrompTs (ADAPT) for Vision-Language Navigation (VLN)
ADAPT provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment.
Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.
- Score: 92.92047324641622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Navigation (VLN) is a challenging task that requires an
embodied agent to perform action-level modality alignment, i.e., make
instruction-asked actions sequentially in complex visual environments. Most
existing VLN agents learn the instruction-path data directly and cannot
sufficiently explore action-level alignment knowledge inside the multi-modal
inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT),
which provides the VLN agent with action prompts to enable the explicit
learning of action-level modality alignment to pursue successful navigation.
Specifically, an action prompt is defined as a modality-aligned pair of an
image sub-prompt and a text sub-prompt, where the former is a single-view
observation and the latter is a phrase like ''walk past the chair''. When
starting navigation, the instruction-related action prompt set is retrieved
from a pre-built action prompt base and passed through a prompt encoder to
obtain the prompt feature. Then the prompt feature is concatenated with the
original instruction feature and fed to a multi-layer transformer for action
prediction. To collect high-quality action prompts into the prompt base, we use
the Contrastive Language-Image Pretraining (CLIP) model which has powerful
cross-modality alignment ability. A modality alignment loss and a sequential
consistency loss are further introduced to enhance the alignment of the action
prompt and enforce the agent to focus on the related prompt sequentially.
Experimental results on both R2R and RxR show the superiority of ADAPT over
state-of-the-art methods.
Related papers
- DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - I2EDL: Interactive Instruction Error Detection and Localization [65.25839671641218]
We propose a novel task of Interactive VLN in Continuous Environments (IVLN-CE)
It allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors.
We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations.
arXiv Detail & Related papers (2024-06-07T16:52:57Z) - Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts [37.20272055902246]
Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP) is a novel task augmenting traditional VLN by integrating both natural language and images in instructions.
VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts.
arXiv Detail & Related papers (2024-06-04T11:06:13Z) - APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning [15.844451999840588]
We propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner.
APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
arXiv Detail & Related papers (2024-01-12T04:54:01Z) - Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - MLANet: Multi-Level Attention Network with Sub-instruction for
Continuous Vision-and-Language Navigation [6.478089983471946]
Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision.
In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction.
For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet)
arXiv Detail & Related papers (2023-03-02T16:26:14Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.