Speaker Diarization as a Fully Online Learning Problem in MiniVox
- URL: http://arxiv.org/abs/2006.04376v3
- Date: Thu, 22 Oct 2020 02:56:34 GMT
- Title: Speaker Diarization as a Fully Online Learning Problem in MiniVox
- Authors: Baihan Lin, Xinxin Zhang
- Abstract summary: We proposed a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining.
We built upon existing datasets of real world utterances to automatically curate MiniVox.
We provided a workable web-based recognition system which interactively handles the cold start problem of new user's addition.
- Score: 18.181920080789475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We proposed a novel machine learning framework to conduct real-time
multi-speaker diarization and recognition without prior registration and
pretraining in a fully online learning setting. Our contributions are two-fold.
First, we proposed a new benchmark to evaluate the rarely studied fully online
speaker diarization problem. We built upon existing datasets of real world
utterances to automatically curate MiniVox, an experimental environment which
generates infinite configurations of continuous multi-speaker speech stream.
Second, we considered the practical problem of online learning with
episodically revealed rewards and introduced a solution based on
semi-supervised and self-supervised learning methods. Additionally, we provided
a workable web-based recognition system which interactively handles the cold
start problem of new user's addition by transferring representations of old
arms to new ones with an extendable contextual bandit. We demonstrated that our
proposed method obtained robust performance in the online MiniVox framework.
Related papers
- An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems [18.829793635104608]
We introduce a general framework for ASR in dialog systems.
We show that leveraging our new framework compared to traditional training leads to relative WER reductions of close to 10% in real-world dialog systems.
arXiv Detail & Related papers (2024-09-16T17:59:50Z) - Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z) - DinoSR: Self-Distillation and Online Clustering for Self-supervised
Speech Representation Learning [140.96990096377127]
We introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR)
DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network.
We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
arXiv Detail & Related papers (2023-05-17T07:23:46Z) - A Reinforcement Learning Framework for Online Speaker Diarization [18.181920080789475]
Speaker diarization is a task to label an audio or video recording with the identity of the speaker at each given time stamp.
We propose a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining.
arXiv Detail & Related papers (2023-02-21T15:42:25Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems [75.43457658815943]
We propose Minimalist Transfer Learning (MinTL) to simplify the system design process of task-oriented dialogue systems.
MinTL is a simple yet effective transfer learning framework, which allows us to plug-and-play pre-trained seq2seq models.
We instantiate our learning framework with two pre-trained backbones: T5 and BART, and evaluate them on MultiWOZ.
arXiv Detail & Related papers (2020-09-25T02:19:13Z) - Wandering Within a World: Online Contextualized Few-Shot Learning [62.28521610606054]
We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online setting.
We propose a new prototypical few-shot learning based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world.
arXiv Detail & Related papers (2020-07-09T04:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.