StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2205.15439v2
- Date: Mon, 20 Nov 2023 04:31:13 GMT
- Title: StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis
- Authors: Yinghao Aaron Li, Cong Han, Nima Mesgarani
- Abstract summary: StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance.
Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
- Score: 23.17929822987861
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-to-Speech (TTS) has recently seen great progress in synthesizing
high-quality speech owing to the rapid development of parallel TTS systems, but
producing speech with naturalistic prosodic variations, speaking styles and
emotional tones remains challenging. Moreover, since duration and speech are
generated separately, parallel TTS models still have problems finding the best
monotonic alignments that are crucial for naturalistic speech synthesis. Here,
we propose StyleTTS, a style-based generative model for parallel TTS that can
synthesize diverse speech with natural prosody from a reference speech
utterance. With novel Transferable Monotonic Aligner (TMA) and
duration-invariant data augmentation schemes, our method significantly
outperforms state-of-the-art models on both single and multi-speaker datasets
in subjective tests of speech naturalness and speaker similarity. Through
self-supervised learning of the speaking styles, our model can synthesize
speech with the same prosodic and emotional tone as any given reference speech
without the need for explicitly labeling these categories.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue [19.149834552175076]
This study aims to realize a text-to-speech (TTS) that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS.
arXiv Detail & Related papers (2022-06-24T02:32:12Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS [7.531331499935223]
We train a non-autoregressive parallel neural TTS model hierarchically conditioned on coarse and fine-grained acoustic speech features.
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension.
arXiv Detail & Related papers (2021-10-06T17:58:42Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.