Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
- URL: http://arxiv.org/abs/2111.04040v1
- Date: Sun, 7 Nov 2021 09:53:31 GMT
- Title: Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
- Authors: Sung-Feng Huang, Chyi-Jiunn Lin, Hung-yi Lee
- Abstract summary: We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
- Score: 62.95422526044178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalizing a speech synthesis system is a highly desired application,
where the system can generate speech with the user's voice with rare enrolled
recordings. There are two main approaches to build such a system in recent
works: speaker adaptation and speaker encoding. On the one hand, speaker
adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model
with few enrolled samples. However, they require at least thousands of
fine-tuning steps for high-quality adaptation, making it hard to apply on
devices. On the other hand, speaker encoding methods encode enrollment
utterances into a speaker embedding. The trained TTS model can synthesize the
user's speech conditioned on the corresponding speaker embedding. Nevertheless,
the speaker encoder suffers from the generalization gap between the seen and
unseen speakers.
In this paper, we propose applying a meta-learning algorithm to the speaker
adaptation method. More specifically, we use Model Agnostic Meta-Learning
(MAML) as the training algorithm of a multi-speaker TTS model, which aims to
find a great meta-initialization to adapt the model to any few-shot speaker
adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS
model to unseen speakers efficiently. Our experiments compare the proposed
method (Meta-TTS) with two baselines: a speaker adaptation method baseline and
a speaker encoding method baseline. The evaluation results show that Meta-TTS
can synthesize high speaker-similarity speech from few enrollment samples with
fewer adaptation steps than the speaker adaptation baseline and outperforms the
speaker encoding baseline under the same training scheme. When the speaker
encoder of the baseline is pre-trained with extra 8371 speakers of data,
Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve
comparable results on VCTK dataset.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker
Adaptation and Pronunciation Enhancement [1.7704011486040843]
We show that one can transfer an existing TTS model for new speakers from the same or a different language using only 20 minutes of data.
We first introduce a base multi-lingual Tacotron with language-agnostic input, then demonstrate how transfer learning is done for different scenarios of speaker adaptation.
arXiv Detail & Related papers (2020-11-12T14:05:34Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.