Towards High-Quality Neural TTS for Low-Resource Languages by Learning
Compact Speech Representations
- URL: http://arxiv.org/abs/2210.15131v1
- Date: Thu, 27 Oct 2022 02:32:00 GMT
- Title: Towards High-Quality Neural TTS for Low-Resource Languages by Learning
Compact Speech Representations
- Authors: Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng
- Abstract summary: This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations.
A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms.
We optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages.
- Score: 43.31594896204752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to enhance low-resource TTS by reducing training data
requirements using compact speech representations. A Multi-Stage Multi-Codebook
(MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to
waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs
from the text for TTS synthesis. Moreover, we optimize the training strategy by
leveraging more audio to learn MSMCRs better for low-resource languages. It
selects audio from other languages using speaker similarity metric to augment
the training set, and applies transfer learning to improve training quality. In
MOS tests, the proposed system significantly outperforms FastSpeech and VITS in
standard and low-resource scenarios, showing lower data requirements. The
proposed training strategy effectively enhances MSMCRs on waveform
reconstruction. It improves TTS performance further, which wins 77% votes in
the preference test for the low-resource TTS with only 15 minutes of paired
data.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks.
The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments.
We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z) - QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via
Vector-Quantized Self-Supervised Speech Representation Learning [65.35080911787882]
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements.
Two VQ-S3R learners provide profitable speech representations and pre-trained models for TTS.
The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches.
arXiv Detail & Related papers (2023-08-31T20:25:44Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS [52.51848317549301]
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis.
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data.
In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
arXiv Detail & Related papers (2022-09-22T09:43:17Z) - When Is TTS Augmentation Through a Pivot Language Useful? [26.084140117526488]
We propose to produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language.
Using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results.
Application of these findings improves ASR by 64.5% and 45.0% character error reduction rate (CERR) respectively for two low-resource languages.
arXiv Detail & Related papers (2022-07-20T13:33:41Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.