Multi-Decoder DPRNN: High Accuracy Source Counting and Separation
- URL: http://arxiv.org/abs/2011.12022v2
- Date: Mon, 30 Nov 2020 16:56:04 GMT
- Title: Multi-Decoder DPRNN: High Accuracy Source Counting and Separation
- Authors: Junzhe Zhu, Raymond Yeh, Mark Hasegawa-Johnson
- Abstract summary: We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers.
Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals.
We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.
- Score: 39.36689677776645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an end-to-end trainable approach to single-channel speech
separation with unknown number of speakers. Our approach extends the MulCat
source separation backbone with additional output heads: a count-head to infer
the number of speakers, and decoder-heads for reconstructing the original
signals. Beyond the model, we also propose a metric on how to evaluate source
separation with variable number of speakers. Specifically, we cleared up the
issue on how to evaluate the quality when the ground-truth hasmore or less
speakers than the ones predicted by the model. We evaluate our approach on the
WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our
approach outperforms state-of-the-art in counting the number of speakers and
remains competitive in quality of reconstructed signals.
Related papers
- Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - Investigating Confidence Estimation Measures for Speaker Diarization [4.679826697518427]
Speaker diarization systems segment a conversation recording based on the speakers' identity.
Speaker diarization errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity.
One way to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems.
arXiv Detail & Related papers (2024-06-24T20:21:38Z) - SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z) - End-to-End Multi-speaker ASR with Independent Vector Analysis [80.83577165608607]
We develop an end-to-end system for multi-channel, multi-speaker automatic speech recognition.
We propose a paradigm for joint source separation and dereverberation based on the independent vector analysis (IVA) paradigm.
arXiv Detail & Related papers (2022-04-01T05:45:33Z) - Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Time-Domain Speech Extraction with Spatial Information and Multi Speaker
Conditioning Mechanism [27.19635746008699]
We present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture.
The proposed method is built on an improved multi-channel time-domain speech separation network.
Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline.
arXiv Detail & Related papers (2021-02-07T10:11:49Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Multi-talker ASR for an unknown number of sources: Joint training of
source counting, separation and ASR [91.87500543591945]
We develop an end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers.
Our experiments show very promising performance in counting accuracy, source separation and speech recognition.
Our system generalizes well to a larger number of speakers than it ever saw during training.
arXiv Detail & Related papers (2020-06-04T11:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.