On permutation invariant training for speech source separation
- URL: http://arxiv.org/abs/2102.04945v1
- Date: Tue, 9 Feb 2021 16:57:32 GMT
- Title: On permutation invariant training for speech source separation
- Authors: Xiaoyu Liu and Jordi Pons
- Abstract summary: We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models.
First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain.
Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT)
- Score: 20.82852423999727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study permutation invariant training (PIT), which targets at the
permutation ambiguity problem for speaker independent source separation models.
We extend two state-of-the-art PIT strategies. First, we look at the two-stage
speaker separation and tracking algorithm based on frame level PIT (tPIT) and
clustering, which was originally proposed for the STFT domain, and we adapt it
to work with waveforms and over a learned latent space. Further, we propose an
efficient clustering loss scalable to waveform models. Second, we extend a
recently proposed auxiliary speaker-ID loss with a deep feature loss based on
"problem agnostic speech features", to reduce the local permutation errors made
by the utterance level PIT (uPIT). Our results show that the proposed
extensions help reducing permutation ambiguity. However, we also note that the
studied STFT-based models are more effective at reducing permutation errors
than waveform-based models, a perspective overlooked in recent studies.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - PTP: Boosting Stability and Performance of Prompt Tuning with
Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning.
We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based.
Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z) - Adversarial Permutation Invariant Training for Universal Sound
Separation [23.262892768718824]
In this work, we complement permutation invariant training (PIT) with adversarial losses but find it challenging with the standard formulation used in speech source separation.
We overcome this challenge with a novel I-replacement context-based adversarial loss, and by training with multiple discriminators.
Our experiments show that by simply improving the loss (keeping the same model and dataset) we obtain a non-negligible improvement of 1.4 dB SI-SNRi in the reverberant FUSS dataset.
arXiv Detail & Related papers (2022-10-21T17:04:17Z) - DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion [17.83563578034567]
We propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network.
We show the effectiveness of the proposed method by objective and subjective evaluations.
arXiv Detail & Related papers (2022-10-20T07:30:07Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.