Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech
System
- URL: http://arxiv.org/abs/2004.09607v1
- Date: Mon, 20 Apr 2020 20:11:53 GMT
- Title: Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech
System
- Authors: Viet Lam Phung, Phan Huy Kinh, Anh Tuan Dinh, Quoc Bao Nguyen
- Abstract summary: We aim to optimize the naturalness of TTS system on found data using a novel data processing method.
We showed that an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech.
- Score: 0.7160601421935839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstract End-to-end text-to-speech (TTS) systems has proved its great success
in the presence of a large amount of high-quality training data recorded in
anechoic room with high-quality microphone. Another approach is to use
available source of found data like radio broadcast news. We aim to optimize
the naturalness of TTS system on the found data using a novel data processing
method. The data processing method includes 1) utterance selection and 2)
prosodic punctuation insertion to prepare training data which can optimize the
naturalness of TTS systems. We showed that using the processing data method, an
end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of
natural speech. We showed that the punctuation insertion contributed the most
to the result. To facilitate the research and development of TTS systems, we
distributed the processed data of one speaker at
https://forms.gle/6Hk5YkqgDxAaC2BU6.
Related papers
- SpoofCeleb: Speech Deepfake Detection and SASV In The Wild [76.71096751337888]
SpoofCeleb is a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV)
We utilize source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.
SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions.
arXiv Detail & Related papers (2024-09-18T23:17:02Z) - Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion
Recognition [42.09340937787435]
We investigated the representation ability of different speech self-supervised pre-trained models.
We employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech.
arXiv Detail & Related papers (2023-09-19T03:52:01Z) - Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a
Case Study [44.07589545984369]
We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies.
We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system.
Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER.
arXiv Detail & Related papers (2023-01-22T10:41:58Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.