Related papers: Towards Zero-Shot Text-To-Speech for Arabic Dialects

Towards Zero-Shot Text-To-Speech for Arabic Dialects

URL: http://arxiv.org/abs/2406.16751v3
Date: Sun, 7 Jul 2024 15:27:26 GMT
Title: Towards Zero-Shot Text-To-Speech for Arabic Dialects
Authors: Khai Duy Doan, Abdul Waheed, Muhammad Abdul-Mageed,
Abstract summary: Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis. We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
Score: 16.10882912169842
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af7391 4db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

Related papers

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models [0.0]
We introduce NileTTS: 38 hours of transcribed speech from two speakers across diverse domains.<n>Large language models (LLM) generate Egyptian Arabic content, which is converted to natural speech using audio synthesis tools.<n>We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects.
arXiv Detail & Related papers (2026-02-17T15:58:27Z)
ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection [2.5962590697722447]
We introduce the first multi-dialect Arabic spoofed speech dataset.<n>Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus.
arXiv Detail & Related papers (2025-09-26T18:11:20Z)
MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis [0.0]
MahaTTS-v2 is a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages.<n>Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling.
arXiv Detail & Related papers (2025-08-05T20:49:04Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z)
ArmanTTS single-speaker Persian dataset [2.7836084563851284]
This paper introduces the single speaker dataset: ArmanTTS. We show that ArmanTTS meets the necessary standards for teaching a Persian text-to-speech conversion model. We also combined the Tacotron 2 and HiFi GAN to design a model that can receive phonemes as input, with the output being the corresponding speech.
arXiv Detail & Related papers (2023-04-07T10:52:55Z)
Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z)
Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language. We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z)
ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language. Our pipeline consists of three components: acoustic, pronunciation, and language models. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.