Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds
- URL: http://arxiv.org/abs/2010.13366v2
- Date: Fri, 5 Feb 2021 02:34:07 GMT
- Title: Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds
- Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara
- Abstract summary: Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
- Score: 71.36164750147827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent diarization technologies can be categorized into two approaches, i.e.,
clustering and end-to-end neural approaches, which have different pros and
cons. The clustering-based approaches assign speaker labels to speech regions
by clustering speaker embeddings such as x-vectors. While it can be seen as a
current state-of-the-art approach that works for various challenging data with
reasonable robustness and accuracy, it has a critical disadvantage that it
cannot handle overlapped speech that is inevitable in natural conversational
data. In contrast, the end-to-end neural diarization (EEND), which directly
predicts diarization labels using a neural network, was devised to handle the
overlapped speech. While the EEND, which can easily incorporate emerging
deep-learning technologies, has started outperforming the x-vector clustering
approach in some realistic database, it is difficult to make it work for `long'
recordings (e.g., recordings longer than 10 minutes) because of, e.g., its huge
memory consumption. Block-wise independent processing is also difficult because
it poses an inter-block label permutation problem, i.e., an ambiguity of the
speaker label assignments between blocks. In this paper, we propose a simple
but effective hybrid diarization framework that works with overlapped speech
and for long recordings containing an arbitrary number of speakers. It modifies
the conventional EEND framework to simultaneously output global speaker
embeddings so that speaker clustering can be performed across blocks to solve
the permutation problem. With experiments based on simulated noisy reverberant
2-speaker meeting-like data, we show that the proposed framework works
significantly better than the original EEND especially when the input data is
long.
Related papers
- From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.