AdaSpeech: Adaptive Text to Speech for Custom Voice
- URL: http://arxiv.org/abs/2103.00993v1
- Date: Mon, 1 Mar 2021 13:28:59 GMT
- Title: AdaSpeech: Adaptive Text to Speech for Custom Voice
- Authors: Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao,
Tie-Yan Liu
- Abstract summary: We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
- Score: 104.69219752194863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Custom voice, a specific text to speech (TTS) service in commercial speech
platforms, aims to adapt a source TTS model to synthesize personal voice for a
target speaker using few speech data. Custom voice presents two unique
challenges for TTS adaptation: 1) to support diverse customers, the adaptation
model needs to handle diverse acoustic conditions that could be very different
from source speech data, and 2) to support a large number of customers, the
adaptation parameters need to be small enough for each target speaker to reduce
memory usage while maintaining high voice quality. In this work, we propose
AdaSpeech, an adaptive TTS system for high-quality and efficient customization
of new voices. We design several techniques in AdaSpeech to address the two
challenges in custom voice: 1) To handle different acoustic conditions, we use
two acoustic encoders to extract an utterance-level vector and a sequence of
phoneme-level vectors from the target speech during training; in inference, we
extract the utterance-level vector from a reference speech and use an acoustic
predictor to predict the phoneme-level vectors. 2) To better trade off the
adaptation parameters and voice quality, we introduce conditional layer
normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this
part in addition to speaker embedding for adaptation. We pre-train the source
TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets
(with different acoustic conditions from LibriTTS) with few adaptation data,
e.g., 20 sentences, about 1 minute speech. Experiment results show that
AdaSpeech achieves much better adaptation quality than baseline methods, with
only about 5K specific parameters for each speaker, which demonstrates its
effectiveness for custom voice. Audio samples are available at
https://speechresearch.github.io/adaspeech/.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements.
Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer.
Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.