Many-Speakers Single Channel Speech Separation with Optimal Permutation
Training
- URL: http://arxiv.org/abs/2104.08955v1
- Date: Sun, 18 Apr 2021 20:56:12 GMT
- Title: Many-Speakers Single Channel Speech Separation with Optimal Permutation
Training
- Authors: Shaked Dovrat, Eliya Nachmani, Lior Wolf
- Abstract summary: We present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C3)$ time complexity.
Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.
- Score: 91.22679787578438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single channel speech separation has experienced great progress in the last
few years. However, training neural speech separation for a large number of
speakers (e.g., more than 10 speakers) is out of reach for the current methods,
which rely on the Permutation Invariant Loss (PIT). In this work, we present a
permutation invariant training that employs the Hungarian algorithm in order to
train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in
comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified
architecture that can handle the increased number of speakers. Our approach
separates up to $20$ speakers and improves the previous results for large $C$
by a wide margin.
Related papers
- Multiple Choice Learning for Efficient Speech Separation with Many Speakers [14.259149632246555]
Training speech separation models in the supervised setting raises a permutation problem.
We consider using the Multiple Choice Learning framework, which was originally introduced to tackle ambiguous tasks.
arXiv Detail & Related papers (2024-11-27T16:38:34Z) - Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens [27.08293218877395]
We propose Sortformer, a novel neural model for speaker diarization.
We train Sortformer with unconventional objectives compared to existing end-to-end diarization models.
Code and trained models will be made publicly available via the NVIDIA NeMo framework.
arXiv Detail & Related papers (2024-09-10T17:20:11Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation.
Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Towards Listening to 10 People Simultaneously: An Efficient Permutation
Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm [9.340611077939828]
In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss.
This paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large.
arXiv Detail & Related papers (2020-10-22T17:08:17Z) - Serialized Output Training for End-to-End Overlapped Speech Recognition [35.894025054676696]
serialized output training (SOT) is a novel framework for multi-speaker overlapped speech recognition.
SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another.
We show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models.
arXiv Detail & Related papers (2020-03-28T02:37:09Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.