KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
- URL: http://arxiv.org/abs/2104.08459v1
- Date: Sat, 17 Apr 2021 05:49:57 GMT
- Title: KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
- Authors: Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat
Khassanov, Huseyin Atakan Varol
- Abstract summary: This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
- Score: 4.542831770689362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a high-quality open-source speech synthesis dataset for
Kazakh, a low-resource language spoken by over 13 million people worldwide. The
dataset consists of about 91 hours of transcribed audio recordings spoken by
two professional speakers (female and male). It is the first publicly available
large-scale dataset developed to promote Kazakh text-to-speech (TTS)
applications in both academia and industry. In this paper, we share our
experience by describing the dataset development procedures and faced
challenges, and discuss important future directions. To demonstrate the
reliability of our dataset, we built baseline end-to-end TTS models and
evaluated them using the subjective mean opinion score (MOS) measure.
Evaluation results show that the best TTS models trained on our dataset achieve
MOS above 4 for both speakers, which makes them applicable for practical use.
The dataset, training recipe, and pretrained TTS models are freely available.
Related papers
- Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS [0.0]
This research introduces a comprehensive Bahasa text-to-speech dataset and a novel TTS model, EnGen-TTS.
The proposed EnGen-TTS model performs better than established baselines, achieving a Mean Opinion Score (MOS) of 4.45 $pm$ 0.13.
This research marks a significant advancement in Bahasa TTS technology, with implications for diverse language applications.
arXiv Detail & Related papers (2024-10-09T07:01:05Z) - SpoofCeleb: Speech Deepfake Detection and SASV In The Wild [76.71096751337888]
SpoofCeleb is a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV)
We utilize source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data.
SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions.
arXiv Detail & Related papers (2024-09-18T23:17:02Z) - Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS [0.9092013845117769]
IndicVoices-R (IV-R) is the largest multilingual Indian TTS dataset derived from an ASR dataset.
IV-R matches the quality of gold-standard TTS datasets like LJ,Speech LibriTTS, and IndicTTS.
We release the first TTS model for all 22 official Indian languages.
arXiv Detail & Related papers (2024-09-09T06:28:47Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - ArmanTTS single-speaker Persian dataset [2.7836084563851284]
This paper introduces the single speaker dataset: ArmanTTS.
We show that ArmanTTS meets the necessary standards for teaching a Persian text-to-speech conversion model.
We also combined the Tacotron 2 and HiFi GAN to design a model that can receive phonemes as input, with the output being the corresponding speech.
arXiv Detail & Related papers (2023-04-07T10:52:55Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and
Accompanied Baseline [16.95694149810552]
This paper introduces a high-quality open-source text-to-speech dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide.
The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer.
It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry.
arXiv Detail & Related papers (2022-09-22T08:24:43Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.