Towards Multi-Scale Style Control for Expressive Speech Synthesis
- URL: http://arxiv.org/abs/2104.03521v1
- Date: Thu, 8 Apr 2021 05:50:09 GMT
- Title: Towards Multi-Scale Style Control for Expressive Speech Synthesis
- Authors: Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng
- Abstract summary: The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
- Score: 60.08928435252417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a multi-scale speech style modeling method for
end-to-end expressive speech synthesis. The proposed method employs a
multi-scale reference encoder to extract both the global-scale utterance-level
and the local-scale quasi-phoneme-level style features of the target speech,
which are then fed into the speech synthesis model as an extension to the input
phoneme sequence. During training time, the multi-scale style model could be
jointly trained with the speech synthesis model in an end-to-end fashion. By
applying the proposed method to style transfer task, experimental results
indicate that the controllability of the multi-scale speech style model and the
expressiveness of the synthesized speech are greatly improved. Moreover, by
assigning different reference speeches to extraction of style on each scale,
the flexibility of the proposed method is further revealed.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Generative Adversarial Training for Text-to-Speech Synthesis Based on
Raw Phonetic Input and Explicit Prosody Modelling [0.36868085124383626]
We describe an end-to-end speech synthesis system that uses generative adversarial training.
We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling.
arXiv Detail & Related papers (2023-10-14T18:15:51Z) - A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis [19.271542595753267]
This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text.
We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features.
arXiv Detail & Related papers (2023-09-21T07:45:44Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.