Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource
Highly Expressive Speech
- URL: http://arxiv.org/abs/2106.12896v2
- Date: Fri, 25 Jun 2021 18:25:04 GMT
- Title: Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource
Highly Expressive Speech
- Authors: Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov,
Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa, Thomas Merritt
- Abstract summary: We present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker.
Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech.
- Score: 5.521191428642322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whilst recent neural text-to-speech (TTS) approaches produce high-quality
speech, they typically require a large amount of recordings from the target
speaker. In previous work, a 3-step method was proposed to generate
high-quality TTS while greatly reducing the amount of data required for
training. However, we have observed a ceiling effect in the level of
naturalness achievable for highly expressive voices when using this approach.
In this paper, we present a method for building highly expressive TTS voices
with as little as 15 minutes of speech data from the target speaker. Compared
to the current state-of-the-art approach, our proposed improvements close the
gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker
similarity. Further, we match the naturalness and speaker similarity of a
Tacotron2-based full-data (~10 hours) model using only 15 minutes of target
speaker data, whereas with 30 minutes or more, we significantly outperform it.
The following improvements are proposed: 1) changing from an autoregressive,
attention-based TTS model to a non-autoregressive model replacing attention
with an external duration model and 2) an additional Conditional Generative
Adversarial Network (cGAN) based fine-tuning step.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Adapting TTS models For New Speakers using Transfer Learning [12.46931609726818]
Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data.
We propose transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data.
arXiv Detail & Related papers (2021-10-12T07:51:25Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data.
We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers.
Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.