End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification
- URL: http://arxiv.org/abs/2003.02966v1
- Date: Mon, 24 Feb 2020 14:53:32 GMT
- Title: End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification
- Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji
Nagamatsu
- Abstract summary: We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
- Score: 45.38809571153867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The most common approach to speaker diarization is clustering of speaker
embeddings. However, the clustering-based approach has a number of problems;
i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it
cannot handle speaker overlaps correctly, and (iii) it has trouble adapting
their speaker embedding models to real audio recordings with speaker overlaps.
To solve these problems, we propose the End-to-End Neural Diarization (EEND),
in which a neural network directly outputs speaker diarization results given a
multi-speaker recording. To realize such an end-to-end model, we formulate the
speaker diarization problem as a multi-label classification problem and
introduce a permutation-free objective function to directly minimize
diarization errors. Besides its end-to-end simplicity, the EEND method can
explicitly handle speaker overlaps during training and inference. Just by
feeding multi-speaker recordings with corresponding speaker segment labels, our
model can be easily adapted to real conversations. We evaluated our method on
simulated speech mixtures and real conversation datasets. The results showed
that the EEND method outperformed the state-of-the-art x-vector
clustering-based method, while it correctly handled speaker overlaps. We
explored the neural network architecture for the EEND method, and found that
the self-attention-based neural network was the key to achieving excellent
performance. In contrast to conditioning the network only on its previous and
next hidden states, as is done using bidirectional long short-term memory
(BLSTM), self-attention is directly conditioned on all the frames. By
visualizing the attention weights, we show that self-attention captures global
speaker characteristics in addition to local speech activity dynamics, making
it especially suitable for dealing with the speaker diarization problem.
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation.
Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.