KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
- URL: http://arxiv.org/abs/2104.08459v1
- Date: Sat, 17 Apr 2021 05:49:57 GMT
- Title: KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
- Authors: Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat
Khassanov, Huseyin Atakan Varol
- Abstract summary: This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
- Score: 4.542831770689362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a high-quality open-source speech synthesis dataset for
Kazakh, a low-resource language spoken by over 13 million people worldwide. The
dataset consists of about 91 hours of transcribed audio recordings spoken by
two professional speakers (female and male). It is the first publicly available
large-scale dataset developed to promote Kazakh text-to-speech (TTS)
applications in both academia and industry. In this paper, we share our
experience by describing the dataset development procedures and faced
challenges, and discuss important future directions. To demonstrate the
reliability of our dataset, we built baseline end-to-end TTS models and
evaluated them using the subjective mean opinion score (MOS) measure.
Evaluation results show that the best TTS models trained on our dataset achieve
MOS above 4 for both speakers, which makes them applicable for practical use.
The dataset, training recipe, and pretrained TTS models are freely available.
Related papers
- Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments [8.103855990028842]
We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS.
We found increasing TTS phrase diversity and utterance sampling monotonically improves model performance.
Our experiments are based on English and single word utterances but the findings generalize to i18n languages.
arXiv Detail & Related papers (2024-07-23T21:05:44Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models.
We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z) - ArmanTTS single-speaker Persian dataset [2.7836084563851284]
This paper introduces the single speaker dataset: ArmanTTS.
We show that ArmanTTS meets the necessary standards for teaching a Persian text-to-speech conversion model.
We also combined the Tacotron 2 and HiFi GAN to design a model that can receive phonemes as input, with the output being the corresponding speech.
arXiv Detail & Related papers (2023-04-07T10:52:55Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and
Accompanied Baseline [16.95694149810552]
This paper introduces a high-quality open-source text-to-speech dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide.
The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer.
It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry.
arXiv Detail & Related papers (2022-09-22T08:24:43Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.