MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and
Accompanied Baseline
- URL: http://arxiv.org/abs/2209.10848v1
- Date: Thu, 22 Sep 2022 08:24:43 GMT
- Title: MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and
Accompanied Baseline
- Authors: Yifan Hu, Pengkai Yin, Rui Liu, Feilong Bao and Guanglai Gao
- Abstract summary: This paper introduces a high-quality open-source text-to-speech dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide.
The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer.
It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry.
- Score: 16.95694149810552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a high-quality open-source text-to-speech (TTS)
synthesis dataset for Mongolian, a low-resource language spoken by over 10
million people worldwide. The dataset, named MnTTS, consists of about 8 hours
of transcribed audio recordings spoken by a 22-year-old professional female
Mongolian announcer. It is the first publicly available dataset developed to
promote Mongolian TTS applications in both academia and industry. In this
paper, we share our experience by describing the dataset development procedures
and faced challenges. To demonstrate the reliability of our dataset, we built a
powerful non-autoregressive baseline system based on FastSpeech2 model and
HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score
(MOS) and real time factor (RTF) metrics. Evaluation results show that the
powerful baseline system trained on our dataset achieves MOS above 4 and RTF
about $3.30\times10^{-1}$, which makes it applicable for practical use. The
dataset, training recipe, and pretrained TTS models are freely available
\footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.
Related papers
- Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
on 100K hours of data [15.447206120523356]
BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data.
We show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
arXiv Detail & Related papers (2024-02-12T22:21:30Z) - A Large-scale Dataset for Audio-Language Representation Learning [54.933479346870506]
We present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs.
We construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis
Dataset [19.086710703808794]
Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide.
We make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for the benefit of related researchers.
In this work, we prepare the transcription from various topics and invite three professional Mongolian announcers to form a three-speaker TTS dataset, in which each announcer records 10 hours of speeches in Mongolian, resulting 30 hours in total.
arXiv Detail & Related papers (2022-12-11T14:55:02Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset [4.542831770689362]
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
arXiv Detail & Related papers (2021-04-17T05:49:57Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.