Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
- URL: http://arxiv.org/abs/2407.01291v1
- Date: Mon, 1 Jul 2024 13:45:31 GMT
- Title: Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
- Authors: Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima,
- Abstract summary: We propose a lightweight zero-shot text-to-speech (TTS) method using a mixture of adapters (MoA)
Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model.
Our method achieves high-quality speech synthesis with minimal additional parameters.
- Score: 36.29364245236912
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for
Low-Resource TTS Adaptation [18.84413550077318]
We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers.
Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests.
This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
arXiv Detail & Related papers (2023-05-29T11:39:01Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New
Speakers [8.980713707011953]
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers.
There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers.
We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
arXiv Detail & Related papers (2022-11-01T16:59:54Z) - Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization [15.698168668305001]
We present BOFFIN TTS, a novel approach for few-shot speaker adaptation.
We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
arXiv Detail & Related papers (2020-02-04T16:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.