Multi-scale speaker embedding-based graph attention networks for speaker
diarisation
- URL: http://arxiv.org/abs/2110.03361v1
- Date: Thu, 7 Oct 2021 11:59:02 GMT
- Title: Multi-scale speaker embedding-based graph attention networks for speaker
diarisation
- Authors: Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You Jin Kim, Bong-Jin Lee,
Joon Son Chung
- Abstract summary: We propose a graph attention network for multi-scale speaker diarisation.
We design scale indicators to utilise scale information of each embedding.
We adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings.
- Score: 30.383712356205084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this work is effective speaker diarisation using multi-scale
speaker embeddings. Typically, there is a trade-off between the ability to
recognise short speaker segments and the discriminative power of the embedding,
according to the segment length used for embedding extraction. To this end,
recent works have proposed the use of multi-scale embeddings where segments
with varying lengths are used. However, the scores are combined using a
weighted summation scheme where the weights are fixed after the training phase,
whereas the importance of segment lengths can differ with in a single session.
To address this issue, we present three key contributions in this paper: (1) we
propose graph attention networks for multi-scale speaker diarisation; (2) we
design scale indicators to utilise scale information of each embedding; (3) we
adapt the attention-based aggregation to utilise a pre-computed affinity matrix
from multi-scale embeddings. We demonstrate the effectiveness of our method in
various datasets where the speaker confusion which constitutes the primary
metric drops over 10% in average relative compared to the baseline.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Graph Attention Networks for Speaker Verification [43.01058120303278]
This work presents a novel back-end framework for speaker verification using graph attention networks.
We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks.
After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform.
arXiv Detail & Related papers (2020-10-22T09:08:02Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z) - Weakly Supervised Training of Hierarchical Attention Networks for
Speaker Identification [37.33388614967888]
A hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally.
arXiv Detail & Related papers (2020-05-15T22:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.