Related papers: Towards Versatile Embodied Navigation

Towards Versatile Embodied Navigation

URL: http://arxiv.org/abs/2210.16822v1
Date: Sun, 30 Oct 2022 11:53:49 GMT
Title: Towards Versatile Embodied Navigation
Authors: Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang
Abstract summary: Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
Score: 120.73460380993305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the emergence of varied visual navigation tasks (e.g, image-/object-/audio-goal and vision-language navigation) that specify the target in different ways, the community has made appealing advances in training specialized agents capable of handling individual navigation tasks well. Given plenty of embodied navigation tasks and task-specific solutions, we address a more fundamental question: can we learn a single powerful agent that masters not one but multiple navigation tasks concurrently? First, we propose VXN, a large-scale 3D dataset that instantiates four classic navigation tasks in standardized, continuous, and audiovisual-rich environments. Second, we propose Vienna, a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. Building upon a full-attentive architecture, Vienna formulates various navigation tasks as a unified, parse-and-query procedure: the target description, augmented with four task embeddings, is comprehensively interpreted into a set of diversified goal vectors, which are refined as the navigation progresses, and used as queries to retrieve supportive context from episodic history for decision making. This enables the reuse of knowledge across navigation tasks with varying input domains/modalities. We empirically demonstrate that, compared with learning each visual navigation task individually, our multitask agent achieves comparable or even better performance with reduced complexity.

Related papers

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world. We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks. Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z)
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts [54.11162991206203]
This paper consolidates diverse navigation tasks into a unified and generic framework. We propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions.
arXiv Detail & Related papers (2024-12-07T06:12:53Z)
Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z)
Multi-goal Audio-visual Navigation using Sound Direction Map [10.152838128195468]
We propose a new framework for multi-goal audio-visual navigation. The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound. We propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner.
arXiv Detail & Related papers (2023-08-01T01:26:55Z)
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation. The goal of AVLEN is to localize an audio event via navigating the 3D visual world. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z)
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks. Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z)
Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation [23.877609358505268]
Recent work shows that map-like memory is useful for long-horizon navigation tasks. We propose the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment. We examine how a variety of agent models perform across a spectrum of navigation task complexities.
arXiv Detail & Related papers (2020-12-07T18:42:38Z)
Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z)
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes. Our proposed method combines visual features and 3D spatial representations to learn navigation policy. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.