Guided Training: A Simple Method for Single-channel Speaker Separation
- URL: http://arxiv.org/abs/2103.14330v1
- Date: Fri, 26 Mar 2021 08:46:50 GMT
- Title: Guided Training: A Simple Method for Single-channel Speaker Separation
- Authors: Hao Li, Xueliang Zhang, Guanglai Gao
- Abstract summary: We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation.
Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
- Score: 40.34570426165019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning has shown a great potential for speech separation, especially
for speech and non-speech separation. However, it encounters permutation
problem for multi-speaker separation where both target and interference are
speech. Permutation Invariant training (PIT) was proposed to solve this problem
by permuting the order of the multiple speakers. Another way is to use an
anchor speech, a short speech of the target speaker, to model the speaker
identity. In this paper, we propose a simple strategy to train a long
short-term memory (LSTM) model to solve the permutation problem in speaker
separation. Specifically, we insert a short speech of target speaker at the
beginning of a mixture as guide information. So, the first appearing speaker is
defined as the target. Due to the powerful capability on sequence modeling,
LSTM can use its memory cells to track and separate target speech from
interfering speech. Experimental results show that the proposed training
strategy is effective for speaker separation.
Related papers
- Monaural Multi-Speaker Speech Separation Using Efficient Transformer
Model [0.0]
"Monaural multi-speaker speech separation" presents a speech-separation model based on the Transformer architecture and its efficient forms.
The model has been trained with the LibriMix dataset containing diverse speakers' utterances.
arXiv Detail & Related papers (2023-07-29T15:10:46Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS [36.023566245506046]
We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech.
The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space.
Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
arXiv Detail & Related papers (2022-06-21T11:08:05Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Multi-stage Speaker Extraction with Utterance and Frame-Level Reference
Signals [113.78060608441348]
We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample.
For the first time, we use frame-level sequential speech embedding as the reference for target speaker.
arXiv Detail & Related papers (2020-11-19T03:08:04Z) - Speaker Separation Using Speaker Inventories and Estimated Speech [78.57067876891253]
We propose speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES)
By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches.
arXiv Detail & Related papers (2020-10-20T18:15:45Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.