Related papers: Explicit Intensity Control for Accented Text-to-speech

Explicit Intensity Control for Accented Text-to-speech

URL: http://arxiv.org/abs/2210.15364v1
Date: Thu, 27 Oct 2022 12:23:41 GMT
Title: Explicit Intensity Control for Accented Text-to-speech
Authors: Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou Li
Abstract summary: How to control the intensity of accent in the process of TTS is a very interesting research direction. Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
Score: 65.35831577398174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.

Related papers

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [52.89324095217975]
Previous approaches on accent conversion mainly aimed at making non-native speech sound more native. We develop a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker.
arXiv Detail & Related papers (2024-10-19T06:12:31Z)
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation [20.40688498862892]
We propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model.
arXiv Detail & Related papers (2024-09-13T06:05:10Z)
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects. We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z)
Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z)
DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres. We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
Modelling low-resource accents without accent-specific TTS frontend [4.185844990558149]
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion. We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
arXiv Detail & Related papers (2023-01-11T18:00:29Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference. This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.