Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a
Case Study
- URL: http://arxiv.org/abs/2301.09099v2
- Date: Thu, 26 Jan 2023 07:51:54 GMT
- Title: Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a
Case Study
- Authors: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji
Watanabe, Wassim El-Hajj, Ahmed Ali
- Abstract summary: We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies.
We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system.
Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER.
- Score: 44.07589545984369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning [6.544954579068865]
We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
arXiv Detail & Related papers (2023-12-02T10:52:00Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Semi-supervised transfer learning for language expansion of end-to-end
speech recognition models to low-resource languages [19.44975351652865]
We propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages.
We leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively.
Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoA) in the first pass achieves a WER reduction of 42% relative to the baseline.
arXiv Detail & Related papers (2021-11-19T05:09:16Z) - Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource
Highly Expressive Speech [5.521191428642322]
We present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker.
Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech.
arXiv Detail & Related papers (2021-06-24T10:52:10Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech
System [0.7160601421935839]
We aim to optimize the naturalness of TTS system on found data using a novel data processing method.
We showed that an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech.
arXiv Detail & Related papers (2020-04-20T20:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.