ULN: Towards Underspecified Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2210.10020v1
- Date: Tue, 18 Oct 2022 17:45:06 GMT
- Title: ULN: Towards Underspecified Vision-and-Language Navigation
- Authors: Weixi Feng, Tsu-Jui Fu, Yujie Lu, William Yang Wang
- Abstract summary: Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
- Score: 77.81257404252132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) is a task to guide an embodied agent
moving to a target position using language instructions. Despite the
significant performance improvement, the wide use of fine-grained instructions
fails to characterize more practical linguistic variations in reality. To fill
in this gap, we introduce a new setting, namely Underspecified
vision-and-Language Navigation (ULN), and associated evaluation datasets. ULN
evaluates agents using multi-level underspecified instructions instead of
purely fine-grained or coarse-grained, which is a more realistic and general
setting. As a primary step toward ULN, we propose a VLN framework that consists
of a classification module, a navigation agent, and an
Exploitation-to-Exploration (E2E) module. Specifically, we propose to learn
Granularity Specific Sub-networks (GSS) for the agent to ground multi-level
instructions with minimal additional parameters. Then, our E2E module estimates
grounding uncertainty and conducts multi-step lookahead exploration to improve
the success rate further. Experimental results show that existing VLN models
are still brittle to multi-level language underspecification. Our framework is
more robust and outperforms the baselines on ULN by ~10% relative success rate
across all levels.
Related papers
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation [65.25839671641218]
We propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes.
We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark.
We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization.
arXiv Detail & Related papers (2024-03-15T21:36:15Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - MLANet: Multi-Level Attention Network with Sub-instruction for
Continuous Vision-and-Language Navigation [6.478089983471946]
Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision.
In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction.
For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet)
arXiv Detail & Related papers (2023-03-02T16:26:14Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Soft Expert Reward Learning for Vision-and-Language Navigation [94.86954695912125]
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions.
We introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task.
arXiv Detail & Related papers (2020-07-21T14:17:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.