Towards Versatile Embodied Navigation
- URL: http://arxiv.org/abs/2210.16822v1
- Date: Sun, 30 Oct 2022 11:53:49 GMT
- Title: Towards Versatile Embodied Navigation
- Authors: Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang
- Abstract summary: Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model.
We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
- Score: 120.73460380993305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the emergence of varied visual navigation tasks (e.g,
image-/object-/audio-goal and vision-language navigation) that specify the
target in different ways, the community has made appealing advances in training
specialized agents capable of handling individual navigation tasks well. Given
plenty of embodied navigation tasks and task-specific solutions, we address a
more fundamental question: can we learn a single powerful agent that masters
not one but multiple navigation tasks concurrently? First, we propose VXN, a
large-scale 3D dataset that instantiates four classic navigation tasks in
standardized, continuous, and audiovisual-rich environments. Second, we propose
Vienna, a versatile embodied navigation agent that simultaneously learns to
perform the four navigation tasks with one model. Building upon a
full-attentive architecture, Vienna formulates various navigation tasks as a
unified, parse-and-query procedure: the target description, augmented with four
task embeddings, is comprehensively interpreted into a set of diversified goal
vectors, which are refined as the navigation progresses, and used as queries to
retrieve supportive context from episodic history for decision making. This
enables the reuse of knowledge across navigation tasks with varying input
domains/modalities. We empirically demonstrate that, compared with learning
each visual navigation task individually, our multitask agent achieves
comparable or even better performance with reduced complexity.
Related papers
- Towards Learning a Generalist Model for Embodied Navigation [24.816490551945435]
We propose the first generalist model for embodied navigation, NaviLLM.
It adapts LLMs to embodied navigation by introducing schema-based instruction.
We conduct extensive experiments to evaluate the performance and generalizability of our model.
arXiv Detail & Related papers (2023-12-04T16:32:51Z) - Multi-goal Audio-visual Navigation using Sound Direction Map [10.152838128195468]
We propose a new framework for multi-goal audio-visual navigation.
The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound.
We propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner.
arXiv Detail & Related papers (2023-08-01T01:26:55Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z) - MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation [23.877609358505268]
Recent work shows that map-like memory is useful for long-horizon navigation tasks.
We propose the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment.
We examine how a variety of agent models perform across a spectrum of navigation task complexities.
arXiv Detail & Related papers (2020-12-07T18:42:38Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.