MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization
- URL: http://arxiv.org/abs/2601.19946v2
- Date: Thu, 29 Jan 2026 06:29:30 GMT
- Title: MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization
- Authors: Nikhil Raghav, Avisek Gupta, Swagatam Das, Md Sahidullah,
- Abstract summary: Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers.<n>In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings is sufficient to craft a sparse graph for spectral clustering.<n> Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora.
- Score: 25.78243411853038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at https://github.com/nikhilraghav29/MK-SGC-SC.
Related papers
- Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC [73.23245793460275]
Multi-talker speech recognition faces unique challenges in disentangling and transcribing overlapping speech.<n>This paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR.<n>We propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework.
arXiv Detail & Related papers (2024-09-19T01:26:33Z) - Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization [7.052822052763606]
This study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization.
We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering.
arXiv Detail & Related papers (2024-03-21T10:49:54Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice
Anti-Spoofing [22.47152800242178]
Anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems.
We propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors.
Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof 2019 LA evaluation set.
arXiv Detail & Related papers (2022-11-04T19:31:33Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Diarisation using location tracking with agglomerative clustering [42.13772744221499]
This paper explicitly models the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework.
Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task.
arXiv Detail & Related papers (2021-09-22T08:54:10Z) - Towards Neural Diarization for Unlimited Numbers of Speakers Using
Global and Local Attractors [51.01295414889487]
We introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization.
Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets.
arXiv Detail & Related papers (2021-07-04T05:34:21Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.