Unsupervised Sound Separation Using Mixture Invariant Training
- URL: http://arxiv.org/abs/2006.12701v2
- Date: Sat, 24 Oct 2020 02:03:02 GMT
- Title: Unsupervised Sound Separation Using Mixture Invariant Training
- Authors: Scott Wisdom and Efthymios Tzinis and Hakan Erdogan and Ron J. Weiss
and Kevin Wilson and John R. Hershey
- Abstract summary: We show that MixIT can achieve competitive performance compared to supervised methods on speech separation.
In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures.
- Score: 38.0680944898427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, rapid progress has been made on the problem of
single-channel sound separation using supervised training of deep neural
networks. In such supervised approaches, a model is trained to predict the
component sources from synthetic mixtures created by adding up isolated
ground-truth sources. Reliance on this synthetic training data is problematic
because good performance depends upon the degree of match between the training
data and real-world audio, especially in terms of the acoustic conditions and
distribution of sources. The acoustic properties can be challenging to
accurately simulate, and the distribution of sound types may be hard to
replicate. In this paper, we propose a completely unsupervised method, mixture
invariant training (MixIT), that requires only single-channel acoustic
mixtures. In MixIT, training examples are constructed by mixing together
existing mixtures, and the model separates them into a variable number of
latent sources, such that the separated sources can be remixed to approximate
the original mixtures. We show that MixIT can achieve competitive performance
compared to supervised methods on speech separation. Using MixIT in a
semi-supervised learning setting enables unsupervised domain adaptation and
learning from large amounts of real world data without ground-truth source
waveforms. In particular, we significantly improve reverberant speech
separation performance by incorporating reverberant mixtures, train a speech
enhancement system from noisy mixtures, and improve universal sound separation
by incorporating a large amount of in-the-wild data.
Related papers
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Single-channel speech enhancement using learnable loss mixup [23.434378634735676]
Generalization remains a major problem in supervised learning of single-channel speech enhancement.
We propose learnable loss mixup (LLM), a simple and effortless training diagram, to improve the generalization of deep learning-based speech enhancement models.
Our experimental results on the VCTK benchmark show that learnable loss mixup 3.26 PESQ, achieves outperforming the state-of-the-art.
arXiv Detail & Related papers (2023-12-20T00:25:55Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion
Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference.
OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters.
Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z) - Semantic Equivariant Mixup [54.734054770032934]
Mixup is a well-established data augmentation technique, which can extend the training distribution and regularize the neural networks.
Previous mixup variants tend to over-focus on the label-related information.
We propose a semantic equivariant mixup (sem) to preserve richer semantic information in the input.
arXiv Detail & Related papers (2023-08-12T03:05:53Z) - Over-training with Mixup May Hurt Generalization [32.64382185990981]
We report a previously unobserved phenomenon in Mixup training.
On a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs.
We show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data.
arXiv Detail & Related papers (2023-03-02T18:37:34Z) - Unsupervised Source Separation via Self-Supervised Training [0.913755431537592]
We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures.
Our first method employs permutation invariant training (PIT) to separate artificially-generated mixtures back into the original mixtures.
We improve upon this first method by creating mixtures of source estimates and employing PIT to separate these new mixtures in a cyclic fashion.
We show that MixPIT outperforms a common baseline (MixIT) on our small dataset (SC09Mix), and they have comparable performance on a standard dataset (LibriMix)
arXiv Detail & Related papers (2022-02-08T14:02:50Z) - Unsupervised Audio Source Separation Using Differentiable Parametric
Source Models [8.80867379881193]
We propose an unsupervised model-based deep learning approach to musical source separation.
A neural network is trained to reconstruct the observed mixture as a sum of the sources.
The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods.
arXiv Detail & Related papers (2022-01-24T11:05:30Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Teacher-Student MixIT for Unsupervised and Semi-supervised Speech
Separation [27.19635746008699]
We introduce a novel semi-supervised learning framework for end-to-end speech separation.
The proposed method first uses mixtures of unseparated sources and the mixture invariant training criterion to train a teacher model.
Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem.
arXiv Detail & Related papers (2021-06-15T02:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.