Related papers: Many-Speakers Single Channel Speech Separation with Optimal Permutation Training

Many-Speakers Single Channel Speech Separation with Optimal Permutation Training

URL: http://arxiv.org/abs/2104.08955v1
Date: Sun, 18 Apr 2021 20:56:12 GMT
Title: Many-Speakers Single Channel Speech Separation with Optimal Permutation Training
Authors: Shaked Dovrat, Eliya Nachmani, Lior Wolf
Abstract summary: We present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C3)$ time complexity. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.
Score: 91.22679787578438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified architecture that can handle the increased number of speakers. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.

Related papers

Multiple Choice Learning for Efficient Speech Separation with Many Speakers [14.259149632246555]
Training speech separation models in the supervised setting raises a permutation problem. We consider using the Multiple Choice Learning framework, which was originally introduced to tackle ambiguous tasks.
arXiv Detail & Related papers (2024-11-27T16:38:34Z)
Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens [27.08293218877395]
We propose Sortformer, a novel neural model for speaker diarization. We introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. Code and trained models will be made publicly available via the NVIDIA NeMo framework.
arXiv Detail & Related papers (2024-09-10T17:20:11Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z)
Generation of Speaker Representations Using Heterogeneous Training Batch Assembly [16.534380339042087]
We propose a new CNN-based speaker modeling scheme. We randomly and synthetically augment the training data into a set of segments. A soft label is imposed on each segment based on its speaker occupation ratio.
arXiv Detail & Related papers (2022-03-30T19:59:05Z)
Single-channel speech separation using Soft-minimum Permutation Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal. Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z)
Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation. Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z)
End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers. Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z)
Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm [9.340611077939828]
In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss. This paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large.
arXiv Detail & Related papers (2020-10-22T17:08:17Z)
Serialized Output Training for End-to-End Overlapped Speech Recognition [35.894025054676696]
serialized output training (SOT) is a novel framework for multi-speaker overlapped speech recognition. SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. We show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models.
arXiv Detail & Related papers (2020-03-28T02:37:09Z)
Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.