Related papers: AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

URL: http://arxiv.org/abs/2210.07940v1
Date: Fri, 14 Oct 2022 16:35:06 GMT
Title: AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments
Authors: Sudipta Paul and Amit K. Roy-Chowdhury and Anoop Cherian
Abstract summary: We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation. The goal of AVLEN is to localize an audio event via navigating the 3D visual world. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
Score: 60.98664330268192
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.

Related papers

LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN) Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z)
Multi-goal Audio-visual Navigation using Sound Direction Map [10.152838128195468]
We propose a new framework for multi-goal audio-visual navigation. The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound. We propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner.
arXiv Detail & Related papers (2023-08-01T01:26:55Z)
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments [41.21509045214965]
CAVEN is a framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Our results show that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate.
arXiv Detail & Related papers (2023-06-06T22:32:49Z)
Towards Versatile Embodied Navigation [120.73460380993305]
Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
arXiv Detail & Related papers (2022-10-30T11:53:49Z)
Towards Generalisable Audio Representations for Audio-Visual Navigation [18.738943602529805]
In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments. We propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder.
arXiv Detail & Related papers (2022-06-01T11:00:07Z)
Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source. Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.