A Reinforcement Learning Framework for Online Speaker Diarization
- URL: http://arxiv.org/abs/2302.10924v1
- Date: Tue, 21 Feb 2023 15:42:25 GMT
- Title: A Reinforcement Learning Framework for Online Speaker Diarization
- Authors: Baihan Lin, Xinxin Zhang
- Abstract summary: Speaker diarization is a task to label an audio or video recording with the identity of the speaker at each given time stamp.
We propose a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining.
- Score: 18.181920080789475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker diarization is a task to label an audio or video recording with the
identity of the speaker at each given time stamp. In this work, we propose a
novel machine learning framework to conduct real-time multi-speaker diarization
and recognition without prior registration and pretraining in a fully online
and reinforcement learning setting. Our framework combines embedding
extraction, clustering, and resegmentation into the same problem as an online
decision-making problem. We discuss practical considerations and advanced
techniques such as the offline reinforcement learning, semi-supervision, and
domain adaptation to address the challenges of limited training data and
out-of-distribution environments. Our approach considers speaker diarization as
a fully online learning problem of the speaker recognition task, where the
agent receives no pretraining from any training set before deployment, and
learns to detect speaker identity on the fly through reward feedbacks. The
paradigm of the reinforcement learning approach to speaker diarization presents
an adaptive, lightweight, and generalizable system that is useful for
multi-user teleconferences, where many people might come and go without
extensive pre-registration ahead of time. Lastly, we provide a desktop
application that uses our proposed approach as a proof of concept. To the best
of our knowledge, this is the first approach to apply a reinforcement learning
approach to the speaker diarization task.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - Self-supervised Speaker Recognition Training Using Human-Machine
Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices.
We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity.
Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Speaker Diarization as a Fully Online Learning Problem in MiniVox [18.181920080789475]
We proposed a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining.
We built upon existing datasets of real world utterances to automatically curate MiniVox.
We provided a workable web-based recognition system which interactively handles the cold start problem of new user's addition.
arXiv Detail & Related papers (2020-06-08T06:40:29Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.