Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit
- URL: http://arxiv.org/abs/2008.05656v1
- Date: Thu, 13 Aug 2020 02:54:50 GMT
- Title: Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit
- Authors: Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao
- Abstract summary: A prosody learning mechanism is proposed to model the prosody of speech based on TTS system.
A novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length.
Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model.
- Score: 39.258370942013165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent neural speech synthesis systems have gradually focused on the control
of prosody to improve the quality of synthesized speech, but they rarely
consider the variability of prosody and the correlation between prosody and
semantics together. In this paper, a prosody learning mechanism is proposed to
model the prosody of speech based on TTS system, where the prosody information
of speech is extracted from the melspectrum by a prosody learner and combined
with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the
sematic features of text from the pre-trained language model is introduced to
improve the prosody prediction results. In addition, a novel self-attention
structure, named as local attention, is proposed to lift this restriction of
input text length, where the relative position information of the sequence is
modeled by the relative position matrices so that the position encodings is no
longer needed. Experiments on English and Mandarin show that speech with more
satisfactory prosody has obtained in our model. Especially in Mandarin
synthesis, our proposed model outperforms baseline model with a MOS gap of
0.08, and the overall naturalness of the synthesized speech has been
significantly improved.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis [19.271542595753267]
This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text.
We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features.
arXiv Detail & Related papers (2023-09-21T07:45:44Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.