BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a
Variable Number of Speakers
- URL: http://arxiv.org/abs/2011.02678v2
- Date: Fri, 12 Feb 2021 18:21:17 GMT
- Title: BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a
Variable Number of Speakers
- Authors: Eunjung Han, Chul Lee, Andreas Stolcke
- Abstract summary: We present a novel online end-to-end neural diarization system, BW-EDA-EEND, that processes data incrementally for a variable number of speakers.
For unlimited-latency BW-EDA-EEND, we show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND.
For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, we show accuracy comparable to the offline clustering-based system.
- Score: 20.22005716662987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel online end-to-end neural diarization system, BW-EDA-EEND,
that processes data incrementally for a variable number of speakers. The system
is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et
al., but utilizes the incremental Transformer encoder, attending only to its
left contexts and using block-level recurrence in the hidden states to carry
information from block to block, making the algorithm complexity linear in
time. We propose two variants: For unlimited-latency BW-EDA-EEND, which
processes inputs in linear time, we show only moderate degradation for up to
two speakers using a context size of 10 seconds compared to offline EDA-EEND.
With more than two speakers, the accuracy gap between online and offline grows,
but the algorithm still outperforms a baseline offline clustering diarization
system for one to four speakers with unlimited context size, and shows
comparable accuracy with context size of 10 seconds. For limited-latency
BW-EDA-EEND, which produces diarization outputs block-by-block as audio
arrives, we show accuracy comparable to the offline clustering-based system.
Related papers
- Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Online Neural Diarization of Unlimited Numbers of Speakers [34.465500195087]
A method to perform speaker diarization for an unlimited number of speakers is described in this paper.
The output number of speakers of attractor-based EEND is empirically capped.
EEND-GLA solves this problem by introducing unsupervised clustering into attractor-based EEND.
arXiv Detail & Related papers (2022-06-06T08:48:26Z) - A neural network-supported two-stage algorithm for lightweight
dereverberation on hearing devices [13.49645012479288]
A two-stage lightweight online dereverberation algorithm for hearing devices is presented in this paper.
The approach combines a multi-channel multi-frame linear filter with a single-channel single-frame post-filter.
Both components rely on power spectral density (PSD) estimates provided by deep neural networks (DNNs)
arXiv Detail & Related papers (2022-04-06T11:08:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Efficient Autoprecoder-based deep learning for massive MU-MIMO Downlink
under PA Non-Linearities [0.0]
We present AP-mMIMO, a new method that jointly eliminates the multiuser interference and compensates the severe nonlinear (NL) PA distortions.
Unlike previous works, AP-mMIMO has a low computational complexity, making it suitable for a global energy-efficient system.
arXiv Detail & Related papers (2022-02-03T08:53:52Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.