QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via
Vector-Quantized Self-Supervised Speech Representation Learning
- URL: http://arxiv.org/abs/2309.00126v1
- Date: Thu, 31 Aug 2023 20:25:44 GMT
- Title: QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via
Vector-Quantized Self-Supervised Speech Representation Learning
- Authors: Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen
Meng
- Abstract summary: This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements.
Two VQ-S3R learners provide profitable speech representations and pre-trained models for TTS.
The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches.
- Score: 65.35080911787882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve
TTS quality with lower supervised data requirements via Vector-Quantized
Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more
unlabeled speech audio. This framework comprises two VQ-S3R learners: first,
the principal learner aims to provide a generative Multi-Stage Multi-Codebook
(MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while
decoding it back to the high-quality audio; then, the associate learner further
abstracts the MSMC representation into a highly-compact VQ representation
through a VQ-VAE. These two generative VQ-S3R learners provide profitable
speech representations and pre-trained models for TTS, significantly improving
synthesis quality with the lower requirement for supervised data. QS-TTS is
evaluated comprehensively under various scenarios via subjective and objective
tests in experiments. The results powerfully demonstrate the superior
performance of QS-TTS, winning the highest MOS over supervised or
semi-supervised baseline TTS approaches, especially in low-resource scenarios.
Moreover, comparing various speech representations and transfer learning
methods in TTS further validates the notable improvement of the proposed
VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The
trend of slower decay in the synthesis quality of QS-TTS with decreasing
supervised data further highlights its lower requirements for supervised data,
indicating its great potential in low-resource scenarios.
Related papers
- An Experimental Study: Assessing the Combined Framework of WavLM and
BEST-RQ for Text-to-Speech Synthesis [0.5076419064097734]
We propose a new model architecture specifically suited for text-to-speech (TTS) models.
We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework.
Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms.
arXiv Detail & Related papers (2023-12-08T23:59:25Z) - Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
Quality Assessment [54.31355080688127]
We introduce a text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP)
BVQI-Local demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24% on all datasets.
We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
arXiv Detail & Related papers (2023-04-28T08:06:05Z) - Towards High-Quality Neural TTS for Low-Resource Languages by Learning
Compact Speech Representations [43.31594896204752]
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations.
A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms.
We optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
arXiv Detail & Related papers (2022-10-27T02:32:00Z) - A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS [52.51848317549301]
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis.
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data.
In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
arXiv Detail & Related papers (2022-09-22T09:43:17Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Comparison of Speech Representations for Automatic Quality Estimation in
Multi-Speaker Text-to-Speech Synthesis [21.904558308567122]
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech synthesis.
We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings.
arXiv Detail & Related papers (2020-02-28T10:44:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.