Coarse-to-Fine Recursive Speech Separation for Unknown Number of
Speakers
- URL: http://arxiv.org/abs/2203.16054v1
- Date: Wed, 30 Mar 2022 04:45:34 GMT
- Title: Coarse-to-Fine Recursive Speech Separation for Unknown Number of
Speakers
- Authors: Zhenhao Jin, Xiang Hao and Xiangdong Su
- Abstract summary: This paper formulates the speech separation with the unknown number of speakers as a multi-pass source extraction problem.
Experiments show that the proposed method archived state-of-the-art performance on the WSJ0 dataset with a different number of speakers.
- Score: 8.380514397417457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vast majority of speech separation methods assume that the number of
speakers is known in advance, hence they are specific to the number of
speakers. By contrast, a more realistic and challenging task is to separate a
mixture in which the number of speakers is unknown. This paper formulates the
speech separation with the unknown number of speakers as a multi-pass source
extraction problem and proposes a coarse-to-fine recursive speech separation
method. This method comprises two stages, namely, recursive cue extraction and
target speaker extraction. The recursive cue extraction stage determines how
many computational iterations need to be performed and outputs a coarse cue
speech by monitoring statistics in the mixture. As the number of recursive
iterations increases, the accumulation of distortion eventually comes into the
extracted speech and reminder. Therefore, in the second stage, we use a target
speaker extraction network to extract a fine speech based on the coarse target
cue and the original distortionless mixture. Experiments show that the proposed
method archived state-of-the-art performance on the WSJ0 dataset with a
different number of speakers. Furthermore, it generalizes well to an unseen
large number of speakers.
Related papers
- End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent
Speech Separation [7.453268060082337]
We propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning.
Experimental results demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-12-01T11:06:36Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Multi-talker ASR for an unknown number of sources: Joint training of
source counting, separation and ASR [91.87500543591945]
We develop an end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers.
Our experiments show very promising performance in counting accuracy, source separation and speech recognition.
Our system generalizes well to a larger number of speakers than it ever saw during training.
arXiv Detail & Related papers (2020-06-04T11:25:50Z) - Neural Speaker Diarization with Speaker-Wise Chain Rule [45.60980782843576]
We propose a speaker-wise conditional inference method for speaker diarization.
We show that the proposed method can correctly produce diarization results with a variable number of speakers.
arXiv Detail & Related papers (2020-06-02T17:28:12Z) - SpEx: Multi-Scale Time Domain Speaker Extraction Network [89.00319878262005]
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
arXiv Detail & Related papers (2020-04-17T16:13:06Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.