Single-channel speech separation using Soft-minimum Permutation
Invariant Training
- URL: http://arxiv.org/abs/2111.08635v1
- Date: Tue, 16 Nov 2021 17:25:05 GMT
- Title: Single-channel speech separation using Soft-minimum Permutation
Invariant Training
- Authors: Midia Yousefi, John H.L. Hansen
- Abstract summary: A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
- Score: 60.99112031408449
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of speech separation is to extract multiple speech sources from a
single microphone recording. Recently, with the advancement of deep learning
and availability of large datasets, speech separation has been formulated as a
supervised learning problem. These approaches aim to learn discriminative
patterns of speech, speakers, and background noise using a supervised learning
algorithm, typically a deep neural network. A long-lasting problem in
supervised speech separation is finding the correct label for each separated
speech signal, referred to as label permutation ambiguity. Permutation
ambiguity refers to the problem of determining the output-label assignment
between the separated sources and the available single-speaker speech labels.
Finding the best output-label assignment is required for calculation of
separation error, which is later used for updating parameters of the model.
Recently, Permutation Invariant Training (PIT) has been shown to be a promising
solution in handling the label ambiguity problem. However, the overconfident
choice of the output-label assignment by PIT results in a sub-optimal trained
model. In this work, we propose a probabilistic optimization framework to
address the inefficiency of PIT in finding the best output-label assignment.
Our proposed method entitled trainable Soft-minimum PIT is then employed on the
same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant
Training (PIT) speech separation method. The results of our experiments show
that the proposed method outperforms conventional PIT speech separation
significantly (p-value $ < 0.01$) by +1dB in Signal to Distortion Ratio (SDR)
and +1.5dB in Signal to Interference Ratio (SIR).
Related papers
- Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Improving Label Assignments Learning by Dynamic Sample Dropout Combined
with Layer-wise Optimization in Speech Separation [8.489574755691613]
In supervised speech separation, permutation invariant training (PIT) is widely used to handle label ambiguity by selecting the best permutation to update the model.
Previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments.
We propose a novel training strategy, dynamic sample dropout (DSD), which considers previous best label assignments and evaluation metrics to exclude the samples that may negatively impact the learned label assignments during training.
arXiv Detail & Related papers (2023-11-20T21:37:38Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - On Robust Learning from Noisy Labels: A Permutation Layer Approach [53.798757734297986]
This paper introduces a permutation layer learning approach termed PermLL to dynamically calibrate the training process of a deep neural network (DNN)
We provide two variants of PermLL in this paper: one applies the permutation layer to the model's prediction, while the other applies it directly to the given noisy label.
We validate PermLL experimentally and show that it achieves state-of-the-art performance on both real and synthetic datasets.
arXiv Detail & Related papers (2022-11-29T03:01:48Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - On permutation invariant training for speech source separation [20.82852423999727]
We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models.
First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain.
Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT)
arXiv Detail & Related papers (2021-02-09T16:57:32Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.