Controllable Accented Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2209.10804v1
- Date: Thu, 22 Sep 2022 06:13:07 GMT
- Title: Controllable Accented Text-to-Speech Synthesis
- Authors: Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li
- Abstract summary: We propose a neural TTS architecture that allows us to control the accent and its intensity during inference.
This is the first study of accented TTS synthesis with explicit intensity control.
- Score: 76.80549143755242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accented text-to-speech (TTS) synthesis seeks to generate speech with an
accent (L2) as a variant of the standard version (L1). Accented TTS synthesis
is challenging as L2 is different from L1 in both in terms of phonetic
rendering and prosody pattern. Furthermore, there is no easy solution to the
control of the accent intensity in an utterance. In this work, we propose a
neural TTS architecture, that allows us to control the accent and its intensity
during inference. This is achieved through three novel mechanisms, 1) an accent
variance adaptor to model the complex accent variance with three prosody
controlling factors, namely pitch, energy and duration; 2) an accent intensity
modeling strategy to quantify the accent intensity; 3) a consistency constraint
module to encourage the TTS system to render the expected accent intensity at a
fine level. Experiments show that the proposed system attains superior
performance to the baseline models in terms of accent rendering and intensity
control. To our best knowledge, this is the first study of accented TTS
synthesis with explicit intensity control.
Related papers
- AccentBox: Towards High-Fidelity Zero-Shot Accent Generation [20.40688498862892]
We propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS.
In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers.
In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model.
arXiv Detail & Related papers (2024-09-13T06:05:10Z) - Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z) - Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction.
Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity.
This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Emphasis control for parallel neural TTS [8.039245267912511]
The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody.
Recent parallel neural text-to-speech (TTS) methods are able to generate speech with high fidelity while maintaining high performance.
This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis.
arXiv Detail & Related papers (2021-10-06T18:45:39Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.