UNSSOR: Unsupervised Neural Speech Separation by Leveraging
Over-determined Training Mixtures
- URL: http://arxiv.org/abs/2305.20054v2
- Date: Sun, 29 Oct 2023 14:55:12 GMT
- Title: UNSSOR: Unsupervised Neural Speech Separation by Leveraging
Over-determined Training Mixtures
- Authors: Zhong-Qiu Wang and Shinji Watanabe
- Abstract summary: In reverberant conditions, each microphone acquires a mixture signal of multiple speakers at a different location.
We propose UNSSOR, an algorithm for $textbfu$nsupervised $textbfn$eural.
We show that this loss can promote unsupervised separation of speakers.
- Score: 60.879679764741624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In reverberant conditions with multiple concurrent speakers, each microphone
acquires a mixture signal of multiple speakers at a different location. In
over-determined conditions where the microphones out-number speakers, we can
narrow down the solutions to speaker images and realize unsupervised speech
separation by leveraging each mixture signal as a constraint (i.e., the
estimated speaker images at a microphone should add up to the mixture).
Equipped with this insight, we propose UNSSOR, an algorithm for
$\textbf{u}$nsupervised $\textbf{n}$eural $\textbf{s}$peech
$\textbf{s}$eparation by leveraging $\textbf{o}$ver-determined training
mixtu$\textbf{r}$es. At each training step, we feed an input mixture to a deep
neural network (DNN) to produce an intermediate estimate for each speaker,
linearly filter the estimates, and optimize a loss so that, at each microphone,
the filtered estimates of all the speakers can add up to the mixture to satisfy
the above constraint. We show that this loss can promote unsupervised
separation of speakers. The linear filters are computed in each sub-band based
on the mixture and DNN estimates through the forward convolutive prediction
(FCP) algorithm. To address the frequency permutation problem incurred by using
sub-band FCP, a loss term based on minimizing intra-source magnitude scattering
is proposed. Although UNSSOR requires over-determined training mixtures, we can
train DNNs to achieve under-determined separation (e.g., unsupervised monaural
speech separation). Evaluation results on two-speaker separation in reverberant
conditions show the effectiveness and potential of UNSSOR.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - Offline Reinforcement Learning at Multiple Frequencies [62.08749079914275]
We study how well offline reinforcement learning algorithms can accommodate data with a mixture of frequencies during training.
We present a simple yet effective solution that enforces consistency in the rate of $Q$-value updates to stabilize learning.
arXiv Detail & Related papers (2022-07-26T17:54:49Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity [15.780905917870427]
We propose a new perspective on batch mixup and formulate the optimal construction of a batch of mixup data.
We also propose an efficient modular approximation based iterative submodular computation algorithm for efficient mixup per each minibatch.
Our experiments show the proposed method achieves the state of the art generalization, calibration, and weakly supervised localization results.
arXiv Detail & Related papers (2021-02-05T09:12:02Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - One Size Fits All: Can We Train One Denoiser for All Noise Levels? [13.46272057205994]
It is often preferred to train one neural network estimator and apply it to all noise levels.
The de facto protocol is to train the estimator with noisy samples whose noise are uniformly distributed.
This paper addresses the sample problem from a minimax risk optimization perspective.
arXiv Detail & Related papers (2020-05-19T17:56:04Z) - Sparse Mixture of Local Experts for Efficient Speech Enhancement [19.645016575334786]
We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity.
Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
arXiv Detail & Related papers (2020-05-16T23:23:22Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.