Related papers: Modelling low-resource accents without accent-specific TTS frontend

Modelling low-resource accents without accent-specific TTS frontend

URL: http://arxiv.org/abs/2301.04606v1
Date: Wed, 11 Jan 2023 18:00:29 GMT
Title: Modelling low-resource accents without accent-specific TTS frontend
Authors: Georgi Tinchev, Marta Czarnowska, Kamil Deja, Kayoko Yanagisawa, Marius Cotescu
Abstract summary: This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion. We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
Score: 4.185844990558149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS) frontend, including a grapheme-to-phoneme (G2P) module. Prior work on modelling accents assumes a phonetic transcription is available for the target accent, which might not be the case for low-resource, regional accents. In our work, we propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion, then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the donor's voice speaking in the target accent. Throughout the procedure, we use a TTS frontend developed for the same language but a different accent. We show qualitative and quantitative analysis where the proposed strategy achieves state-of-the-art results compared to other generative models. Our work demonstrates that low resource accents can be modelled with relatively little data and without developing an accent-specific TTS frontend. Audio samples of our model converting to multiple accents are available on our web page.

Related papers

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects. We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z)
Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Multilingual Multiaccented Multispeaker TTS with RADTTS [21.234787964238645]
We present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents.
arXiv Detail & Related papers (2023-01-24T22:39:04Z)
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS) Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction. Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z)
Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference. This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z)
Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data. We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers. Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z)
AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition [3.028098724882708]
We first spell out the key requirements for creating a well-curated database of speech samples in non-native accents for training and testing robust ASR systems. We then introduce AccentDB, one such database that contains samples of 4 Indian-English accents collected by us. We present several accent classification models and evaluate them thoroughly against human-labelled accent classes.
arXiv Detail & Related papers (2020-05-16T12:38:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.