Related papers: ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

URL: http://arxiv.org/abs/2305.18028v1
Date: Mon, 29 May 2023 11:39:01 GMT
Title: ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation
Authors: Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder, Soujanya Poria
Abstract summary: We propose the use of the "mixture of adapters" method to learn unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests. This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind.
Score: 18.84413550077318
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.

Related papers

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters [36.29364245236912]
We propose a lightweight zero-shot text-to-speech (TTS) method using a mixture of adapters (MoA) Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. Our method achieves high-quality speech synthesis with minimal additional parameters.
arXiv Detail & Related papers (2024-07-01T13:45:31Z)
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR) We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z)
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers [8.980713707011953]
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. We propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules.
arXiv Detail & Related papers (2022-11-01T16:59:54Z)
Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters. Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z)
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z)
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model. We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z)
A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z)
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties [115.12997212870962]
Specialized language and task adapters have been proposed to facilitate cross-lingual transfer of multilingual pretrained models. An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance. In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters.
arXiv Detail & Related papers (2021-09-10T13:44:46Z)
Exploiting Adapters for Cross-lingual Low-resource Speech Recognition [52.40623653290499]
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language. We propose adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech adaptation.
arXiv Detail & Related papers (2021-05-18T08:30:37Z)
Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness. This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization [15.698168668305001]
We present BOFFIN TTS, a novel approach for few-shot speaker adaptation. We show that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio.
arXiv Detail & Related papers (2020-02-04T16:37:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.