Word-Level Style Control for Expressive, Non-attentive Speech Synthesis
- URL: http://arxiv.org/abs/2111.10173v1
- Date: Fri, 19 Nov 2021 12:03:53 GMT
- Title: Word-Level Style Control for Expressive, Non-attentive Speech Synthesis
- Authors: Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park,
Spyros Raptis
- Abstract summary: It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders.
We find that the resulting model gives both word-level and global control over the style, as well as prosody transfer capabilities.
- Score: 1.8262960053058506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an expressive speech synthesis architecture for modeling
and controlling the speaking style at a word level. It attempts to learn
word-level stylistic and prosodic representations of the speech data, with the
aid of two encoders. The first one models style by finding a combination of
style tokens for each word given the acoustic features, and the second outputs
a word-level sequence conditioned only on the phonetic information in order to
disentangle it from the style information. The two encoder outputs are aligned
and concatenated with the phoneme encoder outputs and then decoded with a
Non-Attentive Tacotron model. An extra prior encoder is used to predict the
style tokens autoregressively, in order for the model to be able to run without
a reference utterance. We find that the resulting model gives both word-level
and global control over the style, as well as prosody transfer capabilities.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Unsupervised word-level prosody tagging for controllable speech
synthesis [19.508501785186755]
We propose a novel approach for unsupervised word-level prosody tagging with two stages.
We first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM.
A TTS system with the derived word-level prosody tags is trained for controllable speech synthesis.
arXiv Detail & Related papers (2022-02-15T05:28:23Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Inference Time Style Control for Summarization [6.017006996402699]
We present two novel methods that can be deployed during summary decoding on any pre-trained Transformer-based summarization model.
In experiments of summarizing with simplicity control, automatic evaluation and human judges both find our models producing outputs in simpler languages while still informative.
arXiv Detail & Related papers (2021-04-05T00:27:18Z) - Exploring Contextual Word-level Style Relevance for Unsupervised Style
Transfer [60.07283363509065]
Unsupervised style transfer aims to change the style of an input sentence while preserving its original content.
We propose a novel attentional sequence-to-sequence model that exploits the relevance of each output word to the target style.
Experimental results show that our proposed model achieves state-of-the-art performance in terms of both transfer accuracy and content preservation.
arXiv Detail & Related papers (2020-05-05T10:24:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.