Combination of Deep Speaker Embeddings for Diarisation
- URL: http://arxiv.org/abs/2010.12025v3
- Date: Fri, 7 May 2021 08:59:17 GMT
- Title: Combination of Deep Speaker Embeddings for Diarisation
- Authors: Guangzhi Sun and Chao Zhang and Phil Woodland
- Abstract summary: This paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components.
A neural-based single-pass speaker diarisation pipeline is also proposed in this paper.
Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets.
- Score: 9.053645441056256
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Significant progress has recently been made in speaker diarisation after the
introduction of d-vectors as speaker embeddings extracted from neural network
(NN) speaker classifiers for clustering speech segments. To extract
better-performing and more robust speaker embeddings, this paper proposes a
c-vector method by combining multiple sets of complementary d-vectors derived
from systems with different NN components. Three structures are used to
implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear
pooling structures, relying on attention mechanisms, a gating mechanism, and a
low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based
single-pass speaker diarisation pipeline is also proposed in this paper, which
uses NNs to achieve voice activity detection, speaker change point detection,
and speaker embedding extraction. Experiments and detailed analyses are
conducted on the challenging AMI and NIST RT05 datasets which consist of real
meetings with 4--10 speakers and a wide range of acoustic conditions. For
systems trained on the AMI training set, relative speaker error rate (SER)
reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors
on the AMI dev and eval sets respectively, and a relative reduction of 15% in
SER is observed on RT05, which shows the robustness of the proposed methods. By
incorporating VoxCeleb data into the training set, the best c-vector system
achieved 7%, 17% and16% relative SER reduction compared to the d-vector on the
AMI dev, eval, and RT05 sets respectively
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Generation of Speaker Representations Using Heterogeneous Training Batch
Assembly [16.534380339042087]
We propose a new CNN-based speaker modeling scheme.
We randomly and synthetically augment the training data into a set of segments.
A soft label is imposed on each segment based on its speaker occupation ratio.
arXiv Detail & Related papers (2022-03-30T19:59:05Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Neural i-vectors [21.13825969777844]
We investigate the use of deep embedding extractor and i-vector extractor in succession.
To bundle the deep embedding extractor with an i-vector extractor, we adopt aggregation layers inspired by the Gaussian mixture model (GMM) to the embedding extractor networks.
We compare the deep embeddings to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets.
arXiv Detail & Related papers (2020-04-03T13:29:31Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.