Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
- URL: http://arxiv.org/abs/2501.08566v1
- Date: Wed, 15 Jan 2025 04:17:48 GMT
- Title: Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
- Authors: Qianniu Chen, Xiaoyang Hao, Bowen Li, Yue Liu, Li Lu,
- Abstract summary: We present a lightweight and stable zero-shot Text-To-Speech (TTS) system.
We introduce a novel TTS architecture designed to effectively model linguistic content and speaker attributes from source speech and prompt speech.
- Score: 17.020173869112163
- License:
- Abstract: Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech [7.038489351956803]
MultiVerse is a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions.
We use source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations.
Our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts.
arXiv Detail & Related papers (2024-10-04T07:10:25Z) - SpoofCeleb: Speech Deepfake Detection and SASV In The Wild [76.71096751337888]
SpoofCeleb is a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV)
We utilize source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.
SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions.
arXiv Detail & Related papers (2024-09-18T23:17:02Z) - kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech [18.701864254184308]
kNN-TTS is a simple and effective framework for zero-shot multi-speaker text-to-speech.
Our models, trained on transcribed speech from a single speaker, achieve performance comparable to state-of-the-art models.
We also introduce a parameter which enables fine-grained voice morphing.
arXiv Detail & Related papers (2024-08-20T12:09:58Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
Conversion for everyone [0.7927630381442314]
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.
We achieve state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset.
It is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality.
arXiv Detail & Related papers (2021-12-04T19:50:29Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.