A Deep Reinforcement Learning Approach to Audio-Based Navigation in a
Multi-Speaker Environment
- URL: http://arxiv.org/abs/2105.04488v1
- Date: Mon, 10 May 2021 16:26:47 GMT
- Title: A Deep Reinforcement Learning Approach to Audio-Based Navigation in a
Multi-Speaker Environment
- Authors: Petros Giannakopoulos, Aggelos Pikrakis, Yannis Cotronis
- Abstract summary: We create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment.
Our experiments show that the agent can successfully identify a particular target speaker among a set of $N$ predefined speakers in a room.
The agent is shown to be robust to speaker pitch shifting and it can learn to navigate the environment, even when a limited number of training utterances are available for each speaker.
- Score: 1.0527821704930371
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work we use deep reinforcement learning to create an autonomous agent
that can navigate in a two-dimensional space using only raw auditory sensory
information from the environment, a problem that has received very little
attention in the reinforcement learning literature. Our experiments show that
the agent can successfully identify a particular target speaker among a set of
$N$ predefined speakers in a room and move itself towards that speaker, while
avoiding collision with other speakers or going outside the room boundaries.
The agent is shown to be robust to speaker pitch shifting and it can learn to
navigate the environment, even when a limited number of training utterances are
available for each speaker.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System [0.9591674293850556]
We propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples.
Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations.
arXiv Detail & Related papers (2024-05-24T15:49:00Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Know your audience: specializing grounded language models with listener
subtraction [20.857795779760917]
We take inspiration from Dixit to formulate a multi-agent image reference game.
We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization.
arXiv Detail & Related papers (2022-06-16T17:52:08Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - A Deep Reinforcement Learning Approach for Audio-based Navigation and
Audio Source Localization in Multi-speaker Environments [1.0527821704930371]
In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within.
We create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio source localization problem.
We also create an autonomous agent based on PPO online reinforcement learning algorithm and attempt to train it to solve these environments.
arXiv Detail & Related papers (2021-10-25T10:18:34Z) - A Real-time Speaker Diarization System Based on Spatial Spectrum [14.189768987932364]
We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks.
First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech.
arXiv Detail & Related papers (2021-07-20T08:25:23Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.