Into-TTS : Intonation Template based Prosody Control System
- URL: http://arxiv.org/abs/2204.01271v1
- Date: Mon, 4 Apr 2022 06:37:19 GMT
- Title: Into-TTS : Intonation Template based Prosody Control System
- Authors: Jihwan Lee, Joun Yeop Lee, Heejin Choi, Seongkyu Mun, Sangjun Park,
Chanwoo Kim
- Abstract summary: Intonations take an important role in delivering the intention of the speaker.
Current end-to-end TTS systems often fail to model proper intonations.
We propose a novel, intuitive method to synthesize speech in different intonations.
- Score: 17.68906373821669
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Intonations take an important role in delivering the intention of the
speaker. However, current end-to-end TTS systems often fail to model proper
intonations. To alleviate this problem, we propose a novel, intuitive method to
synthesize speech in different intonations using predefined intonation
templates. Prior to the acoustic model training, speech data are automatically
grouped into intonation templates by k-means clustering, according to their
sentence-final F0 contour. Two proposed modules are added to the end-to-end TTS
framework: intonation classifier and intonation encoder. The intonation
classifier recommends a suitable intonation template to the given text. The
intonation encoder, attached to the text encoder output, synthesizes speech
abiding the requested intonation template. Main contributions of our paper are:
(a) an easy-to-use intonation control system covering a wide range of users;
(b) better performance in wrapping speech in a requested intonation with
improved pitch distance and MOS; and (c) feasibility to future integration
between TTS and NLP, TTS being able to utilize contextual information. Audio
samples are available at https://srtts.github.io/IntoTTS.
Related papers
- Word-wise intonation model for cross-language TTS systems [0.0]
The proposed model is suitable for automatic data markup and its extended application to text-to-speech systems.
The key idea is a partial elimination of the variability connected with different placements of a stressed syllable in a word.
The proposed model could be used as a tool for intonation research or as a backbone for prosody description in text-to-speech systems.
arXiv Detail & Related papers (2024-09-30T15:09:42Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations.
We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework.
Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.