Related papers: Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

URL: http://arxiv.org/abs/2211.00585v1
Date: Tue, 1 Nov 2022 16:59:54 GMT
Title: Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
Authors: Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg
Abstract summary: Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
Score: 8.980713707011953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.

Related papers

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters [36.29364245236912]
We propose a lightweight zero-shot text-to-speech (TTS) method using a mixture of adapters (MoA) Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. Our method achieves high-quality speech synthesis with minimal additional parameters.
arXiv Detail & Related papers (2024-07-01T13:45:31Z)
ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation [18.84413550077318]
We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests. This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
arXiv Detail & Related papers (2023-05-29T11:39:01Z)
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR) We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters. Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z)
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z)
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model. We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z)
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization [15.698168668305001]
We present BOFFIN TTS, a novel approach for few-shot speaker adaptation. We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
arXiv Detail & Related papers (2020-02-04T16:37:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.