BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization
- URL: http://arxiv.org/abs/2002.01953v1
- Date: Tue, 4 Feb 2020 16:37:52 GMT
- Title: BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization
- Authors: Henry B.Moss, Vatsal Aggarwal, Nishant Prateek, Javier Gonz\'alez,
Roberto Barra-Chicote
- Abstract summary: We present BOFFIN TTS, a novel approach for few-shot speaker adaptation.
We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
- Score: 15.698168668305001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To
Speech), a novel approach for few-shot speaker adaptation. Here, the task is to
fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus
of target utterances. We demonstrate that there does not exist a
one-size-fits-all adaptation strategy, with convincing synthesis requiring a
corpus-specific configuration of the hyper-parameters that control fine-tuning.
By using Bayesian optimization to efficiently optimize these hyper-parameter
values for a target speaker, we are able to perform adaptation with an average
30% improvement in speaker similarity over standard techniques. Results
indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new
speakers using less than ten minutes of audio, achieving the same naturalness
as produced for the speakers used to train the base model.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New
Speakers [8.980713707011953]
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers.
There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers.
We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
arXiv Detail & Related papers (2022-11-01T16:59:54Z) - Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.