Diarisation using location tracking with agglomerative clustering
- URL: http://arxiv.org/abs/2109.10598v2
- Date: Fri, 24 Sep 2021 01:44:01 GMT
- Title: Diarisation using location tracking with agglomerative clustering
- Authors: Jeremy H. M. Wong, Igor Abramovski, Xiong Xiao, and Yifan Gong
- Abstract summary: This paper explicitly models the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework.
Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task.
- Score: 42.13772744221499
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Previous works have shown that spatial location information can be
complementary to speaker embeddings for a speaker diarisation task. However,
the models used often assume that speakers are fairly stationary throughout a
meeting. This paper proposes to relax this assumption, by explicitly modelling
the movements of speakers within an Agglomerative Hierarchical Clustering (AHC)
diarisation framework. Kalman filters, which track the locations of speakers,
are used to compute log-likelihood ratios that contribute to the cluster
affinity computations for the AHC merging and stopping decisions. Experiments
show that the proposed approach is able to yield improvements on a Microsoft
rich meeting transcription task, compared to methods that do not use location
information or that make stationarity assumptions.
Related papers
- Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens [45.161909551392085]
We propose a novel attention-based encoder-decoder method augmented with speaker class tokens obtained by speaker clustering.
During inference, we select multiple recognition hypotheses conditioned on predicted speaker cluster tokens.
These hypotheses are merged by agglomerative hierarchical clustering based on the normalized edit distance.
arXiv Detail & Related papers (2024-09-24T04:31:46Z) - Tight integration of neural- and clustering-based diarization through
deep unfolding of infinite Gaussian mixture model [84.57667267657382]
This paper introduces a it trainable clustering algorithm into the integration framework.
Speaker embeddings are optimized during training such that it better fits iGMM clustering.
Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate.
arXiv Detail & Related papers (2022-02-14T07:45:21Z) - Joint speaker diarisation and tracking in switching state-space model [51.58295550366401]
This paper proposes to explicitly track the movements of speakers while jointly performing diarisation within a unified model.
A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers.
Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
arXiv Detail & Related papers (2021-09-23T04:43:58Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z) - Probabilistic embeddings for speaker diarization [13.276960253126656]
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization.
We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix.
These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments.
arXiv Detail & Related papers (2020-04-06T14:51:01Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.