Visual-and-Language Navigation: A Survey and Taxonomy
- URL: http://arxiv.org/abs/2108.11544v1
- Date: Thu, 26 Aug 2021 01:51:18 GMT
- Title: Visual-and-Language Navigation: A Survey and Taxonomy
- Authors: Wansen Wu, Tao Chang, Xinmeng Li
- Abstract summary: This paper provides a comprehensive survey on Visual-and-Language Navigation (VLN) tasks.
According to when the instructions are given, the tasks can be divided into single-turn and multi-turn.
This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
- Score: 1.0742675209112622
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An agent that can understand natural-language instruction and carry out
corresponding actions in the visual world is one of the long-term challenges of
Artificial Intelligent (AI). Due to multifarious instructions from humans, it
requires the agent can link natural language to vision and action in
unstructured, previously unseen environments. If the instruction given by human
is a navigation task, this challenge is called Visual-and-Language Navigation
(VLN). It is a booming multi-disciplinary field of increasing importance and
with extraordinary practicality. Instead of focusing on the details of specific
methods, this paper provides a comprehensive survey on VLN tasks and makes a
classification carefully according the different characteristics of language
instructions in these tasks. According to when the instructions are given, the
tasks can be divided into single-turn and multi-turn. For single-turn tasks, we
further divided them into goal-orientation and route-orientation based on
whether the instructions contain a route. For multi-turn tasks, we divided them
into imperative task and interactive task based on whether the agent responses
to the instructions. This taxonomy enable researchers to better grasp the key
point of a specific task and identify directions for future research.
Related papers
- An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models [99.31449616860291]
Modern language models (LMs) can learn to perform new tasks in different ways.
In instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly.
In instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description.
arXiv Detail & Related papers (2024-04-03T19:31:56Z) - NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction [22.31940101833938]
This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions.
We construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer.
arXiv Detail & Related papers (2024-02-06T17:09:25Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - LINGO : Visually Debiasing Natural Language Instructions to Support Task
Diversity [11.44413929033824]
We develop LINGO, a novel visual analytics interface that supports an effective, task-driven workflow.
We conduct a user study with both novice and expert instruction creators, over a dataset of 1,616 linguistic tasks and their natural language instructions.
For both user groups, LINGO promotes the creation of more difficult tasks for pre-trained models, that contain higher linguistic diversity and lower instruction bias.
arXiv Detail & Related papers (2023-04-12T22:55:52Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - Robustness of Learning from Task Instructions [15.462970803323563]
Traditional supervised learning mostly works on individual tasks and requires training on a large set of task-specific examples.
To build a system that can quickly and easily generalize to new tasks, task instructions have been adopted as an emerging trend of supervision.
This work investigates the system robustness when the instructions of new tasks are (i) manipulated, (ii) paraphrased, or (iii) from different levels of conciseness.
arXiv Detail & Related papers (2022-12-07T17:54:59Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.