Unsupervised word-level prosody tagging for controllable speech
synthesis
- URL: http://arxiv.org/abs/2202.07200v2
- Date: Wed, 16 Feb 2022 05:43:03 GMT
- Title: Unsupervised word-level prosody tagging for controllable speech
synthesis
- Authors: Yiwei Guo, Chenpeng Du, Kai Yu
- Abstract summary: We propose a novel approach for unsupervised word-level prosody tagging with two stages.
We first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM.
A TTS system with the derived word-level prosody tags is trained for controllable speech synthesis.
- Score: 19.508501785186755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although word-level prosody modeling in neural text-to-speech (TTS) has been
investigated in recent research for diverse speech synthesis, it is still
challenging to control speech synthesis manually without a specific reference.
This is largely due to lack of word-level prosody tags. In this work, we
propose a novel approach for unsupervised word-level prosody tagging with two
stages, where we first group the words into different types with a decision
tree according to their phonetic content and then cluster the prosodies using
GMM within each type of words separately. This design is based on the
assumption that the prosodies of different type of words, such as long or short
words, should be tagged with different label sets. Furthermore, a TTS system
with the derived word-level prosody tags is trained for controllable speech
synthesis. Experiments on LJSpeech show that the TTS model trained with
word-level prosody tags not only achieves better naturalness than a typical
FastSpeech2 model, but also gains the ability to manipulate word-level prosody.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - RWEN-TTS: Relation-aware Word Encoding Network for Natural
Text-to-Speech Synthesis [3.591224588041813]
A huge number of text-to-speech (TTS) models produce human-like speech.
Relation-aware Word Network (RWEN) effectively allows syntactic and semantic information based on two modules.
Experimental results show substantial improvements compared to previous works.
arXiv Detail & Related papers (2022-12-15T16:17:03Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Word-Level Style Control for Expressive, Non-attentive Speech Synthesis [1.8262960053058506]
It attempts to learn word-level stylistic and prosodic representations of the speech data, with the aid of two encoders.
We find that the resulting model gives both word-level and global control over the style, as well as prosody transfer capabilities.
arXiv Detail & Related papers (2021-11-19T12:03:53Z) - Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes [49.901984490961624]
We propose the first unsupervised abstractive dialogue summarization model for tete-a-tetes (SuTaT)
SuTaT consists of a conditional generative module and two unsupervised summarization modules.
Experimental results show that SuTaT is superior on unsupervised dialogue summarization for both automatic and human evaluations.
arXiv Detail & Related papers (2020-09-15T03:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.