An Automated End-to-End Open-Source Software for High-Quality
Text-to-Speech Dataset Generation
- URL: http://arxiv.org/abs/2402.16380v1
- Date: Mon, 26 Feb 2024 07:58:33 GMT
- Title: An Automated End-to-End Open-Source Software for High-Quality
Text-to-Speech Dataset Generation
- Authors: Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio
Minazzi, Nicola Sobieski and Sebastien Bratieres
- Abstract summary: This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models.
The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection.
The proposed application aims to streamline the dataset creation process for TTS models through these features.
- Score: 3.6893151241749966
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Data availability is crucial for advancing artificial intelligence
applications, including voice-based technologies. As content creation,
particularly in social media, experiences increasing demand, translation and
text-to-speech (TTS) technologies have become essential tools. Notably, the
performance of these TTS technologies is highly dependent on the quality of the
training data, emphasizing the mutual dependence of data availability and
technological progress. This paper introduces an end-to-end tool to generate
high-quality datasets for text-to-speech (TTS) models to address this critical
need for high-quality data. The contributions of this work are manifold and
include: the integration of language-specific phoneme distribution into sample
selection, automation of the recording process, automated and human-in-the-loop
quality assurance of recordings, and processing of recordings to meet specified
formats. The proposed application aims to streamline the dataset creation
process for TTS models through these features, thereby facilitating
advancements in voice-based technologies.
Related papers
- Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement [54.51467153859695]
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE)
We aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance.
arXiv Detail & Related papers (2025-01-23T04:27:37Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS [0.0]
This research introduces a comprehensive Bahasa text-to-speech dataset and a novel TTS model, EnGen-TTS.
The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $pm$ 0.13.
This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications.
arXiv Detail & Related papers (2024-10-09T07:01:05Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z) - HUI-Audio-Corpus-German: A high quality TTS dataset [0.0]
"HUI-Audio-Corpus-German" is a large, open-source dataset for TTS engines, created with a processing pipeline.
This dataset produces high quality audio to transcription alignments and decreases manual effort needed for creation.
arXiv Detail & Related papers (2021-06-11T10:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.