Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker
Adaptation and Pronunciation Enhancement
- URL: http://arxiv.org/abs/2011.06392v2
- Date: Thu, 31 Mar 2022 15:49:56 GMT
- Title: Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker
Adaptation and Pronunciation Enhancement
- Authors: Hamed Hemati, Damian Borth
- Abstract summary: We show that one can transfer an existing TTS model for new speakers from the same or a different language using only 20 minutes of data.
We first introduce a base multi-lingual Tacotron with language-agnostic input, then demonstrate how transfer learning is done for different scenarios of speaker adaptation.
- Score: 1.7704011486040843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent neural Text-to-Speech (TTS) models have been shown to perform very
well when enough data is available. However, fine-tuning them for new speakers
or languages is not straightforward in a low-resource setup. In this paper, we
show that by applying minor modifications to a Tacotron model, one can transfer
an existing TTS model for new speakers from the same or a different language
using only 20 minutes of data. For this purpose, we first introduce a base
multi-lingual Tacotron with language-agnostic input, then demonstrate how
transfer learning is done for different scenarios of speaker adaptation without
exploiting any pre-trained speaker encoder or code-switching technique. We
evaluate the transferred model in both subjective and objective ways.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning [6.544954579068865]
We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
arXiv Detail & Related papers (2023-12-02T10:52:00Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Adapting TTS models For New Speakers using Transfer Learning [12.46931609726818]
Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data.
We propose transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data.
arXiv Detail & Related papers (2021-10-12T07:51:25Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Continual Speaker Adaptation for Text-to-Speech Synthesis [2.3224617218247126]
In this paper, we look at TTS modeling from a continual learning perspective.
The goal is to add new speakers without forgetting previous speakers.
We exploit two well-known techniques for continual learning namely experience replay and weight regularization.
arXiv Detail & Related papers (2021-03-26T15:14:20Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.