Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2308.16593v1
- Date: Thu, 31 Aug 2023 09:50:33 GMT
- Title: Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis
- Authors: Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin
Kang, Helen Meng
- Abstract summary: We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
- Score: 53.511443791260206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The spontaneous behavior that often occurs in conversations makes speech more
human-like compared to reading-style. However, synthesizing spontaneous-style
speech is challenging due to the lack of high-quality spontaneous datasets and
the high cost of labeling spontaneous behavior. In this paper, we propose a
semi-supervised pre-training method to increase the amount of spontaneous-style
speech and spontaneous behavioral labels. In the process of semi-supervised
learning, both text and speech information are considered for detecting
spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is
used to model the relationship between each sentence in the conversation.
Experimental results indicate that our proposed method achieves superior
expressive speech synthesis performance with the ability to model spontaneous
behavior in spontaneous-style speech and predict reasonable spontaneous
behavior from text.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Controlling Emotion in Text-to-Speech with Natural Language Prompts [29.013577423045255]
We propose a system conditioned on embeddings derived from an emotionally rich text iteration that serves as prompt.
A joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture.
Our approach is trained on merged emotional speech and text datasets and varies prompts in each training to increase the generalization capabilities of the model.
arXiv Detail & Related papers (2024-06-10T15:58:42Z) - Expressivity and Speech Synthesis [51.75420054449122]
We outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity.
We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology.
arXiv Detail & Related papers (2024-04-30T08:47:24Z) - Non-verbal information in spontaneous speech -- towards a new framework
of analysis [0.5559722082623594]
This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals.
We present a classification process that disentangles prosodic phenomena of three orders.
Disentangling prosodic patterns can direct a theory of communication and speech organization.
arXiv Detail & Related papers (2024-03-06T08:03:05Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for
Personalized Spontaneous Speech Synthesis [35.32703818003108]
We focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency.
We develop a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus.
arXiv Detail & Related papers (2022-10-14T06:29:33Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - On the Role of Style in Parsing Speech with Neural Models [25.442727974788255]
We show that neural approaches facilitate using written text to improve parsing of spontaneous speech.
We find an asymmetric degradation from read vs. spontaneous speech.
arXiv Detail & Related papers (2020-10-08T22:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.