Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
- URL: http://arxiv.org/abs/2308.14909v1
- Date: Mon, 28 Aug 2023 21:25:05 GMT
- Title: Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
- Authors: Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo
Kang
- Abstract summary: We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
- Score: 26.533600745910437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For personalized speech generation, a neural text-to-speech (TTS) model must
be successfully implemented with limited data from a target speaker. To this
end, the baseline TTS model needs to be amply generalized to out-of-domain data
(i.e., target speaker's speech). However, approaches to address this
out-of-domain generalization problem in TTS have yet to be thoroughly studied.
In this work, we propose an effective pruning method for a transformer known as
sparse attention, to improve the TTS model's generalization abilities. In
particular, we prune off redundant connections from self-attention layers whose
attention weights are below the threshold. To flexibly determine the pruning
strength for searching optimal degree of generalization, we also propose a new
differentiable pruning method that allows the model to automatically learn the
thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness
of our method in terms of voice quality and speaker similarity.
Related papers
- SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection [7.6732312922460055]
We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features.
We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker text-to-speech frameworks in both objective and subjective metrics.
arXiv Detail & Related papers (2024-08-30T17:34:46Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation [21.218195769245032]
This paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters.
Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches.
arXiv Detail & Related papers (2022-10-28T03:33:07Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.