Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation
- URL: http://arxiv.org/abs/2210.15868v1
- Date: Fri, 28 Oct 2022 03:33:07 GMT
- Title: Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation
- Authors: Nobuyuki Morioka, Heiga Zen, Nanxin Chen, Yu Zhang, Yifan Ding
- Abstract summary: This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
- Score: 21.218195769245032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting a neural text-to-speech (TTS) model to a target speaker typically
involves fine-tuning most if not all of the parameters of a pretrained
multi-speaker backbone model. However, serving hundreds of fine-tuned neural
TTS models is expensive as each of them requires significant footprint and
separate computational resources (e.g., accelerators, memory). To scale speaker
adapted neural TTS voices to hundreds of speakers while preserving the
naturalness and speaker similarity, this paper proposes a parameter-efficient
few-shot speaker adaptation, where the backbone model is augmented with
trainable lightweight modules called residual adapters. This architecture
allows the backbone model to be shared across different target speakers.
Experimental results show that the proposed approach can achieve competitive
naturalness and speaker similarity compared to the full fine-tuning approaches,
while requiring only $\sim$0.1% of the backbone model parameters for each
speaker.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for
Low-Resource TTS Adaptation [18.84413550077318]
We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers.
Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests.
This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
arXiv Detail & Related papers (2023-05-29T11:39:01Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New
Speakers [8.980713707011953]
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers.
There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers.
We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
arXiv Detail & Related papers (2022-11-01T16:59:54Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization [15.698168668305001]
We present BOFFIN TTS, a novel approach for few-shot speaker adaptation.
We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
arXiv Detail & Related papers (2020-02-04T16:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.