Multi-stage Speaker Extraction with Utterance and Frame-Level Reference
Signals
- URL: http://arxiv.org/abs/2011.09624v2
- Date: Fri, 2 Apr 2021 08:38:58 GMT
- Title: Multi-stage Speaker Extraction with Utterance and Frame-Level Reference
Signals
- Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang,
Haizhou Li
- Abstract summary: We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample.
For the first time, we use frame-level sequential speech embedding as the reference for target speaker.
- Score: 113.78060608441348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker extraction requires a sample speech from the target speaker as the
reference. However, enrolling a speaker with a long speech is not practical. We
propose a speaker extraction technique, that performs in multiple stages to
take full advantage of short reference speech sample. The extracted speech in
early stages is used as the reference speech for late stages. For the first
time, we use frame-level sequential speech embedding as the reference for
target speaker. This is a departure from the traditional utterance-based
speaker embedding reference. In addition, a signal fusion scheme is proposed to
combine the decoded signals in multiple scales with automatically learned
weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!)
show that SpEx++ consistently outperforms other state-of-the-art baselines.
Related papers
- LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech [25.707717591185386]
We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality.
All of our code and trained models are available, alongside static and interactive demos.
arXiv Detail & Related papers (2022-06-24T11:54:59Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation.
Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z) - Speaker Separation Using Speaker Inventories and Estimated Speech [78.57067876891253]
We propose speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES)
By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches.
arXiv Detail & Related papers (2020-10-20T18:15:45Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.