Multilingual Text-to-Speech Synthesis for Turkic Languages Using
Transliteration
- URL: http://arxiv.org/abs/2305.15749v1
- Date: Thu, 25 May 2023 05:57:54 GMT
- Title: Multilingual Text-to-Speech Synthesis for Turkic Languages Using
Transliteration
- Authors: Rustem Yeshpanov, Saida Mussakhojayeva, Yerbolat Khassanov
- Abstract summary: This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages.
We specifically target the zero-shot learning scenario, where a TTS model trained using the data of one language is applied to synthesise speech for other, unseen languages.
An end-to-end TTS system based on the Tacotron 2 architecture was trained using only the available data of the Kazakh language.
- Score: 3.0122461286351796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work aims to build a multilingual text-to-speech (TTS) synthesis system
for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz,
Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek. We specifically target the
zero-shot learning scenario, where a TTS model trained using the data of one
language is applied to synthesise speech for other, unseen languages. An
end-to-end TTS system based on the Tacotron 2 architecture was trained using
only the available data of the Kazakh language. To generate speech for the
other Turkic languages, we first mapped the letters of the Turkic alphabets
onto the symbols of the International Phonetic Alphabet (IPA), which were then
converted to the Kazakh alphabet letters. To demonstrate the feasibility of the
proposed approach, we evaluated the multilingual Turkic TTS model subjectively
and obtained promising results. To enable replication of the experiments, we
make our code and dataset publicly available in our GitHub repository.
Related papers
- Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data,
Speakers, and Topics [4.859986264602551]
We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus.
In the new KazakhTTS2 corpus, the overall size is increased from 93 hours to 271 hours.
The number of speakers has risen from two to five (three females and two males), and the topic coverage is diversified with the help of new sources, including a book and Wikipedia articles.
arXiv Detail & Related papers (2022-01-15T06:54:30Z) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset [4.542831770689362]
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
arXiv Detail & Related papers (2021-04-17T05:49:57Z) - Towards Natural Bilingual and Code-Switched Speech Synthesis Based on
Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages.
A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech.
The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.