Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for
Speaker Diarization
- URL: http://arxiv.org/abs/2401.12850v1
- Date: Tue, 23 Jan 2024 15:35:44 GMT
- Title: Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for
Speaker Diarization
- Authors: Prachi Singh, Sriram Ganapathy
- Abstract summary: We propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN)
The proposed E-SHARC framework improves significantly over the state-of-art diarization systems.
- Score: 41.24045486520547
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speaker diarization, the task of segmenting an audio recording based on
speaker identity, constitutes an important speech pre-processing step for
several downstream applications. The conventional approach to diarization
involves multiple steps of embedding extraction and clustering, which are often
optimized in an isolated fashion. While end-to-end diarization systems attempt
to learn a single model for the task, they are often cumbersome to train and
require large supervised datasets. In this paper, we propose an end-to-end
supervised hierarchical clustering algorithm based on graph neural networks
(GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The
E-SHARC approach uses front-end mel-filterbank features as input and jointly
learns an embedding extractor and the GNN clustering module, performing
representation learning, metric learning, and clustering with end-to-end
optimization. Further, with additional inputs from an external overlap
detector, the E-SHARC approach is capable of predicting the speakers in the
overlapping speech regions. The experimental evaluation on several benchmark
datasets like AMI, VoxConverse and DISPLACE, illustrates that the proposed
E-SHARC framework improves significantly over the state-of-art diarization
systems.
Related papers
- Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances [24.142013877384603]
This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field.
UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training.
We show remarkable improvements of 2-6% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain.
arXiv Detail & Related papers (2024-05-21T13:24:07Z) - Supervised Hierarchical Clustering using Graph Neural Networks for
Speaker Diarization [41.30830281043803]
We propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization.
In this paper, we introduce a hierarchical structure using Graph Neural Network (GNN) to perform supervised clustering.
The supervised clustering is performed using node densities and edge existence probabilities to merge the segments until convergence.
arXiv Detail & Related papers (2023-02-24T16:16:41Z) - Highly Efficient Real-Time Streaming and Fully On-Device Speaker
Diarization with Multi-Stage Clustering [18.62774420511154]
A multi-stage clustering strategy that uses different clustering algorithms for input of different lengths can address multi-faceted challenges of speaker diarization applications.
This strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.
arXiv Detail & Related papers (2022-10-25T01:20:24Z) - A Deep Dive into Deep Cluster [0.2578242050187029]
DeepCluster is a simple and scalable unsupervised pretraining of visual representations.
We show that DeepCluster convergence and performance depend on the interplay between the quality of the randomly filters of the convolutional layer and the selected number of clusters.
arXiv Detail & Related papers (2022-07-24T22:55:09Z) - Tight integration of neural- and clustering-based diarization through
deep unfolding of infinite Gaussian mixture model [84.57667267657382]
This paper introduces a it trainable clustering algorithm into the integration framework.
Speaker embeddings are optimized during training such that it better fits iGMM clustering.
Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate.
arXiv Detail & Related papers (2022-02-14T07:45:21Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.