A Deep Reinforcement Learning Approach for Audio-based Navigation and
Audio Source Localization in Multi-speaker Environments
- URL: http://arxiv.org/abs/2110.12778v1
- Date: Mon, 25 Oct 2021 10:18:34 GMT
- Title: A Deep Reinforcement Learning Approach for Audio-based Navigation and
Audio Source Localization in Multi-speaker Environments
- Authors: Petros Giannakopoulos, Aggelos Pikrakis, Yannis Cotronis
- Abstract summary: In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within.
We create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem.
We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments.
- Score: 1.0527821704930371
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we apply deep reinforcement learning to the problems of
navigating a three-dimensional environment and inferring the locations of human
speaker audio sources within, in the case where the only available information
is the raw sound from the environment, as a simulated human listener placed in
the environment would hear it. For this purpose we create two virtual
environments using the Unity game engine, one presenting an audio-based
navigation problem and one presenting an audio source localization problem. We
also create an autonomous agent based on PPO online reinforcement learning
algorithm and attempt to train it to solve these environments. Our experiments
show that our agent achieves adequate performance and generalization ability in
both environments, measured by quantitative metrics, even when a limited amount
of training data are available or the environment parameters shift in ways not
encountered during training. We also show that a degree of agent knowledge
transfer is possible between the environments.
Related papers
- Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments [0.7373617024876725]
We propose an audio-driven DRL framework to develop an autonomous agent that orients towards a talker in the acoustic environment.
Our results show that the agent learned to perform the task at a near perfect level when trained on speech segments in anechoic environments.
arXiv Detail & Related papers (2024-09-16T07:20:33Z) - Audio Simulation for Sound Source Localization in Virtual Evironment [0.0]
Non-line-of-sight localization in signal-deprived environments is a challenging yet pertinent problem.
In this study, we aim to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods.
arXiv Detail & Related papers (2024-04-02T03:18:28Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual
Navigation in Noisy Environments [41.21509045214965]
CAVEN is a framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal.
Our results show that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate.
arXiv Detail & Related papers (2023-06-06T22:32:49Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener.
We explore how to infer RIRs based on a sparse set of images and echoes observed in the space.
In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z) - Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped
Environments with Moving Sounds [5.002862602915434]
Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment.
We propose the novel dynamic audio-visual navigation benchmark which requires to catch a moving sound source in an environment with noisy and distracting sounds.
We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments.
arXiv Detail & Related papers (2021-11-29T15:17:46Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - A Deep Reinforcement Learning Approach to Audio-Based Navigation in a
Multi-Speaker Environment [1.0527821704930371]
We create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment.
Our experiments show that the agent can successfully identify a particular target speaker among a set of $N$ predefined speakers in a room.
The agent is shown to be robust to speaker pitch shifting and it can learn to navigate the environment, even when a limited number of training utterances are available for each speaker.
arXiv Detail & Related papers (2021-05-10T16:26:47Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.