Explicit Intensity Control for Accented Text-to-speech
- URL: http://arxiv.org/abs/2210.15364v1
- Date: Thu, 27 Oct 2022 12:23:41 GMT
- Title: Explicit Intensity Control for Accented Text-to-speech
- Authors: Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou Li
- Abstract summary: How to control the intensity of accent in the process of TTS is a very interesting research direction.
Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity.
This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
- Score: 65.35831577398174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accented text-to-speech (TTS) synthesis seeks to generate speech with an
accent (L2) as a variant of the standard version (L1). How to control the
intensity of accent in the process of TTS is a very interesting research
direction, and has attracted more and more attention. Recent work design a
speaker-adversarial loss to disentangle the speaker and accent information, and
then adjust the loss weight to control the accent intensity. However, such a
control method lacks interpretability, and there is no direct correlation
between the controlling factor and natural accent intensity. To this end, this
paper propose a new intuitive and explicit accent intensity control scheme for
accented TTS. Specifically, we first extract the posterior probability, called
as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to
quantify the phoneme accent intensity for accented speech, then design a
FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity
expression into account during speech generation. Experiments show that the our
method outperforms the baseline model in terms of accent rendering and
intensity control.
Related papers
- Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Modelling low-resource accents without accent-specific TTS frontend [4.185844990558149]
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS)
We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion.
We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
arXiv Detail & Related papers (2023-01-11T18:00:29Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder [14.323313455208183]
This paper introduces a novel framework for accented Text-to-Speech (TTS) synthesis based on a Variational Autoencoder.
It has the ability to synthesize a selected speaker's voice, which is converted to any desired target accent.
arXiv Detail & Related papers (2022-11-07T05:36:30Z) - Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference.
This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.