MLANet: Multi-Level Attention Network with Sub-instruction for
Continuous Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2303.01396v1
- Date: Thu, 2 Mar 2023 16:26:14 GMT
- Title: MLANet: Multi-Level Attention Network with Sub-instruction for
Continuous Vision-and-Language Navigation
- Authors: Zongtao He, Liuyi Wang, Shu Li, Qingqing Yan, Chengju Liu and Qijun
Chen
- Abstract summary: Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision.
In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction.
For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet)
- Score: 6.478089983471946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) aims to develop intelligent agents to
navigate in unseen environments only through language and vision supervision.
In the recently proposed continuous settings (continuous VLN), the agent must
act in a free 3D space and faces tougher challenges like real-time execution,
complex instruction understanding, and long action sequence prediction. For a
better performance in continuous VLN, we design a multi-level instruction
understanding procedure and propose a novel model, Multi-Level Attention
Network (MLANet). The first step of MLANet is to generate sub-instructions
efficiently. We design a Fast Sub-instruction Algorithm (FSA) to segment the
raw instruction into sub-instructions and generate a new sub-instruction
dataset named ``FSASub". FSA is annotation-free and faster than the current
method by 70 times, thus fitting the real-time requirement in continuous VLN.
To solve the complex instruction understanding problem, MLANet needs a global
perception of the instruction and observations. We propose a Multi-Level
Attention (MLA) module to fuse vision, low-level semantics, and high-level
semantics, which produce features containing a dynamic and global comprehension
of the task. MLA also mitigates the adverse effects of noise words, thus
ensuring a robust understanding of the instruction. To correctly predict
actions in long trajectories, MLANet needs to focus on what sub-instruction is
being executed every step. We propose a Peak Attention Loss (PAL) to improve
the flexible and adaptive selection of the current sub-instruction. PAL
benefits the navigation agent by concentrating its attention on the local
information, thus helping the agent predict the most appropriate actions. We
train and test MLANet in the standard benchmark. Experiment results show MLANet
outperforms baselines by a significant margin.
Related papers
- Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy [37.471419716572086]
There is a significant gap in instruction-following capabilities between Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
We propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap.
arXiv Detail & Related papers (2024-11-23T05:03:32Z) - Neurosymbolic AI for Enhancing Instructability in Generative AI [7.4348066967005275]
Generative AI has transformed content creation across text, images, and music, showcasing capabilities in following instructions through prompting.
This article explores why neurosymbolic AI offers a better path to enhance the instructability of Large Language Models (LLMs)
We show that neurosymbolic approach enhances the reliability and context-awareness of task execution, enabling LLMs to dynamically interpret and respond to a wider range of instructional contexts with greater precision and flexibility.
arXiv Detail & Related papers (2024-07-26T13:15:50Z) - INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [59.07490387145391]
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks.
Their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language.
We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories.
arXiv Detail & Related papers (2024-01-12T12:10:28Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z) - Boosting Natural Language Generation from Instructions with
Meta-Learning [43.64522457686405]
Recent work has shown that language models (LMs) trained with multi-task.
textitinstructional learning (MTIL) can solve diverse NLP.
tasks with improved performance compared to prompt tuning.
In this paper we investigate whether meta-learning applied to MTIL can further improve generalization to unseen tasks in a zero-shot setting.
arXiv Detail & Related papers (2022-10-20T22:23:23Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.