Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario
- URL: http://arxiv.org/abs/2203.09767v1
- Date: Fri, 18 Mar 2022 06:40:39 GMT
- Title: Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario
- Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan
- Abstract summary: We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
- Score: 51.5031673695118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we reformulate overlapped speech diarization as a single-label
prediction problem, which is always treated as a multi-label classification
task in previous studies. Specifically, the multiple labels of each frame are
encoded into a single label with the power set, which represents the possible
combinations of different speakers. Through this formulation, we propose the
speaker embedding-aware neural diarization (SEND) system. In SEND, the speech
encoder, speaker encoder, similarity scores, and post-processing network are
optimized to predict the power set encoded labels according to the similarities
between speech features and speaker embeddings. Experimental results show that
our method significantly outperforms the variational Bayesian hidden Markov
model-based clustering algorithm (VBx). Besides, the proposed method has two
benefits compared with the target-speaker voice activity detection (TSVAD).
First, SEND can achieve lower diarization error rates in the real meeting
scenario. Second, when the training data has a high overlap ratio, the learning
process of SEND is more stable than TSVAD.
Related papers
- Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Bi-LSTM Scoring Based Similarity Measurement with Agglomerative
Hierarchical Clustering (AHC) for Speaker Diarization [0.0]
A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences.
Recent advancements in diarization technology leverage neural network-based approaches to improvise speaker diarization system.
We propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix.
arXiv Detail & Related papers (2022-05-19T17:20:51Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.