AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
- URL: http://arxiv.org/abs/2204.00436v1
- Date: Fri, 1 Apr 2022 13:47:44 GMT
- Title: AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
- Authors: Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin,
Tie-Yan Liu
- Abstract summary: We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
- Score: 143.47967241972995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive text to speech (TTS) can synthesize new voices in zero-shot
scenarios efficiently, by using a well-trained source TTS model without
adapting it on the speech data of new speakers. Considering seen and unseen
speakers have diverse characteristics, zero-shot adaptive TTS requires strong
generalization ability on speaker characteristics, which brings modeling
challenges. In this paper, we develop AdaSpeech 4, a zero-shot adaptive TTS
system for high-quality speech synthesis. We model the speaker characteristics
systematically to improve the generalization on new speakers. Generally, the
modeling of speaker characteristics can be categorized into three steps:
extracting speaker representation, taking this speaker representation as
condition, and synthesizing speech/mel-spectrogram given this speaker
representation. Accordingly, we improve the modeling in three steps: 1) To
extract speaker representation with better generalization, we factorize the
speaker characteristics into basis vectors and extract speaker representation
by weighted combining of these basis vectors through attention. 2) We leverage
conditional layer normalization to integrate the extracted speaker
representation to TTS model. 3) We propose a novel supervision loss based on
the distribution of basis vectors to maintain the corresponding speaker
characteristics in generated mel-spectrograms. Without any fine-tuning,
AdaSpeech 4 achieves better voice quality and similarity than baselines in
multiple datasets.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with
Disentangled Representations [12.388567657230116]
We propose a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model.
GZS-TV introduces disentangled representation learning for speaker embedding extraction and timbre transformation.
Our experiments demonstrate that GZS-TV reduces performance degradation on unseen speakers and outperforms all baseline models in multiple datasets.
arXiv Detail & Related papers (2023-08-24T18:13:10Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.