Omnidirectional Information Gathering for Knowledge Transfer-based
Audio-Visual Navigation
- URL: http://arxiv.org/abs/2308.10306v1
- Date: Sun, 20 Aug 2023 16:03:54 GMT
- Title: Omnidirectional Information Gathering for Knowledge Transfer-based
Audio-Visual Navigation
- Authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang
- Abstract summary: ORAN is an omnidirectional audio-visual navigator based on cross-task navigation skill transfer.
ORAN sharpens its two basic abilities for a such challenging task, namely wayfinding and audio-visual information gathering.
- Score: 95.2546147495844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual navigation is an audio-targeted wayfinding task where a robot
agent is entailed to travel a never-before-seen 3D environment towards the
sounding source. In this article, we present ORAN, an omnidirectional
audio-visual navigator based on cross-task navigation skill transfer. In
particular, ORAN sharpens its two basic abilities for a such challenging task,
namely wayfinding and audio-visual information gathering. First, ORAN is
trained with a confidence-aware cross-task policy distillation (CCPD) strategy.
CCPD transfers the fundamental, point-to-point wayfinding skill that is well
trained on the large-scale PointGoal task to ORAN, so as to help ORAN to better
master audio-visual navigation with far fewer training samples. To improve the
efficiency of knowledge transfer and address the domain gap, CCPD is made to be
adaptive to the decision confidence of the teacher policy. Second, ORAN is
equipped with an omnidirectional information gathering (OIG) mechanism, i.e.,
gleaning visual-acoustic observations from different directions before
decision-making. As a result, ORAN yields more robust navigation behaviour.
Taking CCPD and OIG together, ORAN significantly outperforms previous
competitors. After the model ensemble, we got 1st in Soundspaces Challenge
2022, improving SPL and SR by 53% and 35% relatively.
Related papers
- NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Multi-goal Audio-visual Navigation using Sound Direction Map [10.152838128195468]
We propose a new framework for multi-goal audio-visual navigation.
The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound.
We propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner.
arXiv Detail & Related papers (2023-08-01T01:26:55Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Towards Generalisable Audio Representations for Audio-Visual Navigation [18.738943602529805]
In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments.
We propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder.
arXiv Detail & Related papers (2022-06-01T11:00:07Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Learning Object Relation Graph and Tentative Policy for Visual
Navigation [44.247995617796484]
It is critical to learn informative visual representation and robust navigation policy.
This paper proposes three complementary techniques, object relation graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN)
We report 22.8% and 23.5% increase in success rate and Success weighted by Path Length (SPL)
arXiv Detail & Related papers (2020-07-21T18:03:05Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.