AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
- URL: http://arxiv.org/abs/2107.02530v1
- Date: Tue, 6 Jul 2021 10:40:45 GMT
- Title: AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
- Authors: Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan
Shen, Wei-Qiang Zhang, Tie-Yan Liu
- Abstract summary: We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
- Score: 111.89762723159677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent text to speech (TTS) models perform very well in synthesizing
reading-style (e.g., audiobook) speech, it is still challenging to synthesize
spontaneous-style speech (e.g., podcast or conversation), mainly because of two
reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty
in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous
speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that
fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
Specifically, 1) to insert filled pauses (FP) in the text sequence
appropriately, we introduce an FP predictor to the TTS model; 2) to model the
varying rhythms, we introduce a duration predictor based on mixture of experts
(MoE), which contains three experts responsible for the generation of fast,
medium and slow speech respectively, and fine-tune it as well as the pitch
predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we
fine-tune some parameters in the decoder with few speech data. To address the
challenge of lack of training data, we mine a spontaneous speech dataset to
support our research this work and facilitate future research on spontaneous
TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and
rhythms in spontaneous styles, and achieves much better MOS and SMOS scores
than previous adaptive TTS systems.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource
Scenarios [5.06044403956839]
We develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios.
We extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way.
Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian.
arXiv Detail & Related papers (2023-05-20T14:24:45Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Prosody-controllable spontaneous TTS with neural HMMs [11.472325158964646]
We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets.
We add utterance-level prosody control to an existing neural HMM-based TTS system.
We evaluate the system's capability of synthesizing two types of creaky voice.
arXiv Detail & Related papers (2022-11-24T11:06:11Z) - StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis [23.17929822987861]
StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance.
Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
arXiv Detail & Related papers (2022-05-30T21:34:40Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.