Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen
Speaker and Recording Conditions
- URL: http://arxiv.org/abs/2008.05289v1
- Date: Sun, 9 Aug 2020 13:54:46 GMT
- Title: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen
Speaker and Recording Conditions
- Authors: Dipjyoti Paul, Yannis Pantazis, Yannis Stylianou
- Abstract summary: Conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers.
We propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN)
In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings.
- Score: 19.691323658303435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in deep learning led to human-level performance in
single-speaker speech synthesis. However, there are still limitations in terms
of speech quality when generalizing those systems into multiple-speaker models
especially for unseen speakers and unseen recording qualities. For instance,
conventional neural vocoders are adjusted to the training speaker and have poor
generalization capabilities to unseen speakers. In this work, we propose a
variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We
target towards the development of an efficient universal vocoder even for
unseen speakers and recording conditions. In contrast to standard WaveRNN,
SC-WaveRNN exploits additional information given in the form of speaker
embeddings. Using publicly-available data for training, SC-WaveRNN achieves
significantly better performance over baseline WaveRNN on both subjective and
objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for
seen speaker and seen recording condition and up to 95% for unseen speaker and
unseen condition. Finally, we extend our work by implementing a multi-speaker
text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In
terms of performance, our system has been preferred over the baseline TTS
system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers,
respectively.
Related papers
- Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - SVSNet: An End-to-end Speaker Voice Similarity Assessment Model [61.3813595968834]
We propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech.
The experimental results on the Voice Conversion Challenge 2018 and 2020 show that SVSNet notably outperforms well-known baseline systems.
arXiv Detail & Related papers (2021-07-20T10:19:46Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.