Neural Speaker Diarization with Speaker-Wise Chain Rule
- URL: http://arxiv.org/abs/2006.01796v1
- Date: Tue, 2 Jun 2020 17:28:12 GMT
- Title: Neural Speaker Diarization with Speaker-Wise Chain Rule
- Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi,
Kenji Nagamatsu
- Abstract summary: We propose a speaker-wise conditional inference method for speaker diarization.
We show that the proposed method can correctly produce diarization results with a variable number of speakers.
- Score: 45.60980782843576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker diarization is an essential step for processing multi-speaker audio.
Although an end-to-end neural diarization (EEND) method achieved
state-of-the-art performance, it is limited to a fixed number of speakers. In
this paper, we solve this fixed number of speaker issue by a novel speaker-wise
conditional inference method based on the probabilistic chain rule. In the
proposed method, each speaker's speech activity is regarded as a single random
variable, and is estimated sequentially conditioned on previously estimated
other speakers' speech activities. Similar to other sequence-to-sequence
models, the proposed method produces a variable number of speakers with a stop
sequence condition. We evaluated the proposed method on multi-speaker audio
recordings of a variable number of speakers. Experimental results show that the
proposed method can correctly produce diarization results with a variable
number of speakers and outperforms the state-of-the-art end-to-end speaker
diarization methods in terms of diarization error rate.
Related papers
- Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS [36.023566245506046]
We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech.
The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space.
Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
arXiv Detail & Related papers (2022-06-21T11:08:05Z) - Coarse-to-Fine Recursive Speech Separation for Unknown Number of
Speakers [8.380514397417457]
This paper formulates the speech separation with the unknown number of speakers as a multi-pass source extraction problem.
Experiments show that the proposed method archived state-of-the-art performance on the WSJ0 dataset with a different number of speakers.
arXiv Detail & Related papers (2022-03-30T04:45:34Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.