ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for
Low-Resource TTS Adaptation
- URL: http://arxiv.org/abs/2305.18028v1
- Date: Mon, 29 May 2023 11:39:01 GMT
- Title: ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for
Low-Resource TTS Adaptation
- Authors: Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder,
Soujanya Poria
- Abstract summary: We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers.
Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests.
This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
- Score: 18.84413550077318
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: There are significant challenges for speaker adaptation in text-to-speech for
languages that are not widely spoken or for speakers with accents or dialects
that are not well-represented in the training data. To address this issue, we
propose the use of the "mixture of adapters" method. This approach involves
adding multiple adapters within a backbone-model layer to learn the unique
characteristics of different speakers. Our approach outperforms the baseline,
with a noticeable improvement of 5% observed in speaker preference tests when
using only one minute of data for each new speaker. Moreover, following the
adapter paradigm, we fine-tune only the adapter parameters (11% of the total
model parameters). This is a significant achievement in parameter-efficient
speaker adaptation, and one of the first models of its kind. Overall, our
proposed approach offers a promising solution to the speech synthesis
techniques, particularly for adapting to speakers from diverse backgrounds.
Related papers
- Lightweight Zero-shot Text-to-Speech with Mixture of Adapters [36.29364245236912]
We propose a lightweight zero-shot text-to-speech (TTS) method using a mixture of adapters (MoA)
Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model.
Our method achieves high-quality speech synthesis with minimal additional parameters.
arXiv Detail & Related papers (2024-07-01T13:45:31Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New
Speakers [8.980713707011953]
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers.
There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers.
We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
arXiv Detail & Related papers (2022-11-01T16:59:54Z) - Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z) - Efficient Test Time Adapter Ensembling for Low-resource Language
Varieties [115.12997212870962]
Specialized language and task adapters have been proposed to facilitate cross-lingual transfer of multilingual pretrained models.
An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance.
In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters.
arXiv Detail & Related papers (2021-09-10T13:44:46Z) - Exploiting Adapters for Cross-lingual Low-resource Speech Recognition [52.40623653290499]
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language.
We propose adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech adaptation.
arXiv Detail & Related papers (2021-05-18T08:30:37Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization [15.698168668305001]
We present BOFFIN TTS, a novel approach for few-shot speaker adaptation.
We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
arXiv Detail & Related papers (2020-02-04T16:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.