Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
- URL: http://arxiv.org/abs/2305.11320v1
- Date: Thu, 18 May 2023 22:02:59 GMT
- Title: Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
- Authors: Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien
- Abstract summary: This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
- Score: 58.356667204518985
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a parameter-efficient learning (PEL) to develop a
low-resource accent adaptation for text-to-speech (TTS). A resource-efficient
adaptation from a frozen pre-trained TTS model is developed by using only 1.2\%
to 0.8\% of original trainable parameters to achieve competitive performance in
voice synthesis. Motivated by a theoretical foundation of optimal transport
(OT), this study carries out PEL for TTS where an auxiliary unsupervised loss
based on OT is introduced to maximize a difference between the pre-trained
source domain and the (unseen) target domain, in addition to its supervised
training loss. Further, we leverage upon this unsupervised loss refinement to
boost system performance via either sliced Wasserstein distance or maximum mean
discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions
based on residual adapter learning, and model reprogramming when evaluating the
Mandarin accent adaptation. Experiment results show that the proposed methods
can achieve competitive naturalness with parameter-efficient decoder
fine-tuning, and the auxiliary unsupervised loss improves model performance
empirically.
Related papers
- Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization [34.51491788470738]
We propose reverse inference optimization (RIO) to enhance the robustness of autoregressive-model-based text-to-speech (TTS) systems.
RIO uses reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself.
RIO significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions.
arXiv Detail & Related papers (2024-07-02T13:04:04Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting [10.559392015748989]
We show that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance.
Our results demonstrate that using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.
arXiv Detail & Related papers (2024-02-19T15:26:19Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Directly Attention Loss Adjusted Prioritized Experience Replay [0.07366405857677226]
Prioritized Replay Experience (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies.
DALAP is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network.
arXiv Detail & Related papers (2023-11-24T10:14:05Z) - Attention Loss Adjusted Prioritized Experience Replay [0.0]
Prioritized Replay Experience (PER) is a technical means of deep reinforcement learning by selecting experience samples with more knowledge quantity to improve the training rate of neural network.
Non-uniform sampling used in PER inevitably shifts the state-action space distribution and brings the estimation error of Q-value function.
An Attention Loss Adjusted Prioritized (ALAP) Experience Replay algorithm is proposed, which integrates the improved Self-Attention network with Double-Sampling mechanism.
arXiv Detail & Related papers (2023-09-13T02:49:32Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z) - Semi-Supervised Learning Based on Reference Model for Low-resource TTS [32.731900584216724]
We propose a semi-supervised learning method for neural TTS in which labeled target data is limited.
Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
arXiv Detail & Related papers (2022-10-25T07:48:07Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.