Voice Separation with an Unknown Number of Multiple Speakers
- URL: http://arxiv.org/abs/2003.01531v4
- Date: Tue, 1 Sep 2020 14:12:16 GMT
- Title: Voice Separation with an Unknown Number of Multiple Speakers
- Authors: Eliya Nachmani, Yossi Adi, Lior Wolf
- Abstract summary: We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
- Score: 113.91855071999298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new method for separating a mixed audio sequence, in which
multiple voices speak simultaneously. The new method employs gated neural
networks that are trained to separate the voices at multiple processing steps,
while maintaining the speaker in each output channel fixed. A different model
is trained for every number of possible speakers, and the model with the
largest number of speakers is employed to select the actual number of speakers
in a given sample. Our method greatly outperforms the current state of the art,
which, as we show, is not competitive for more than two speakers.
Related papers
- TOGGL: Transcribing Overlapping Speech with Staggered Labeling [5.088540556965433]
We propose a model to simultaneously transcribe the speech of multiple speakers.
Our approach generalizes beyond two speakers, even when trained only on two-speaker data.
arXiv Detail & Related papers (2024-08-12T20:19:27Z) - End-to-End Single-Channel Speaker-Turn Aware Conversational Speech
Translation [23.895122319920997]
We tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model.
Speaker-Turn Aware Conversational Speech Translation combines automatic speech recognition, speech translation and speaker turn detection.
We show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition.
arXiv Detail & Related papers (2023-11-01T17:55:09Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Neural Speaker Diarization with Speaker-Wise Chain Rule [45.60980782843576]
We propose a speaker-wise conditional inference method for speaker diarization.
We show that the proposed method can correctly produce diarization results with a variable number of speakers.
arXiv Detail & Related papers (2020-06-02T17:28:12Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.