Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with
Disentangled Representations
- URL: http://arxiv.org/abs/2308.13007v1
- Date: Thu, 24 Aug 2023 18:13:10 GMT
- Title: Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with
Disentangled Representations
- Authors: Wenbin Wang, Yang Song, Sanjay Jha
- Abstract summary: We propose a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model.
GZS-TV introduces disentangled representation learning for speaker embedding extraction and timbre transformation.
Our experiments demonstrate that GZS-TV reduces performance degradation on unseen speakers and outperforms all baseline models in multiple datasets.
- Score: 12.388567657230116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While most research into speech synthesis has focused on synthesizing
high-quality speech for in-dataset speakers, an equally essential yet unsolved
problem is synthesizing speech for unseen speakers who are out-of-dataset with
limited reference data, i.e., speaker adaptive speech synthesis. Many studies
have proposed zero-shot speaker adaptive text-to-speech and voice conversion
approaches aimed at this task. However, most current approaches suffer from the
degradation of naturalness and speaker similarity when synthesizing speech for
unseen speakers (i.e., speakers not in the training dataset) due to the poor
generalizability of the model in out-of-distribution data. To address this
problem, we propose GZS-TV, a generalizable zero-shot speaker adaptive
text-to-speech and voice conversion model. GZS-TV introduces disentangled
representation learning for both speaker embedding extraction and timbre
transformation to improve model generalization and leverages the representation
learning capability of the variational autoencoder to enhance the speaker
encoder. Our experiments demonstrate that GZS-TV reduces performance
degradation on unseen speakers and outperforms all baseline models in multiple
datasets.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - USAT: A Universal Speaker-Adaptive Text-to-Speech Approach [11.022840133207788]
challenge of neglecting lifelike speech for unseen, out-of-dataset speakers remains significant and unresolved.
Zero-shot approaches suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents.
Few-shot methods can reproduce highly varying accents, bringing a significant storage burden and the risk of overfitting and catastrophic forgetting.
Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits.
arXiv Detail & Related papers (2024-04-28T06:50:55Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Continual Speaker Adaptation for Text-to-Speech Synthesis [2.3224617218247126]
In this paper, we look at TTS modeling from a continual learning perspective.
The goal is to add new speakers without forgetting previous speakers.
We exploit two well-known techniques for continual learning namely experience replay and weight regularization.
arXiv Detail & Related papers (2021-03-26T15:14:20Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Noise Robust TTS for Low Resource Speakers using Pre-trained Model and
Speech Enhancement [31.33429812278942]
The proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively.
Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning multi-speaker speech synthesis model.
arXiv Detail & Related papers (2020-05-26T06:14:06Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.