Multi-scale Speaker Diarization with Dynamic Scale Weighting
- URL: http://arxiv.org/abs/2203.15974v1
- Date: Wed, 30 Mar 2022 01:26:31 GMT
- Title: Multi-scale Speaker Diarization with Dynamic Scale Weighting
- Authors: Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam and Boris Ginsburg
- Abstract summary: We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
- Score: 14.473173007997751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker diarization systems are challenged by a trade-off between the
temporal resolution and the fidelity of the speaker representation. By
obtaining a superior temporal resolution with an enhanced accuracy, a
multi-scale approach is a way to cope with such a trade-off. In this paper, we
propose a more advanced multi-scale diarization system based on a multi-scale
diarization decoder. There are two main contributions in this study that
significantly improve the diarization performance. First, we use multi-scale
clustering as an initialization to estimate the number of speakers and obtain
the average speaker representation vector for each speaker and each scale.
Next, we propose the use of 1-D convolutional neural networks that dynamically
determine the importance of each scale at each time step. To handle a variable
number of speakers and overlapping speech, the proposed system can estimate the
number of existing speakers. Our proposed system achieves a state-of-art
performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05%
diarization error rates, respectively.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - High-resolution embedding extractor for speaker diarisation [15.392429990363492]
This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE)
HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success.
Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
arXiv Detail & Related papers (2022-11-08T07:41:18Z) - Generation of Speaker Representations Using Heterogeneous Training Batch
Assembly [16.534380339042087]
We propose a new CNN-based speaker modeling scheme.
We randomly and synthetically augment the training data into a set of segments.
A soft label is imposed on each segment based on its speaker occupation ratio.
arXiv Detail & Related papers (2022-03-30T19:59:05Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Multi-scale speaker embedding-based graph attention networks for speaker
diarisation [30.383712356205084]
We propose a graph attention network for multi-scale speaker diarisation.
We design scale indicators to utilise scale information of each embedding.
We adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings.
arXiv Detail & Related papers (2021-10-07T11:59:02Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.