ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus
- URL: http://arxiv.org/abs/2303.00069v1
- Date: Tue, 28 Feb 2023 20:18:59 GMT
- Title: ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus
- Authors: Ajinkya Kulkarni and Atharva Kulkarni and Sara Abedalmonem Mohammad
Shatnawi and Hanan Aldarmaki
- Abstract summary: We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
- Score: 3.1925030748447747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: At present, Text-to-speech (TTS) systems that are trained with high-quality
transcribed speech data using end-to-end neural models can generate speech that
is intelligible, natural, and closely resembles human speech. These models are
trained with relatively large single-speaker professionally recorded audio,
typically extracted from audiobooks. Meanwhile, due to the scarcity of freely
available speech corpora of this kind, a larger gap exists in Arabic TTS
research and development. Most of the existing freely available Arabic speech
corpora are not suitable for TTS training as they contain multi-speaker casual
speech with variations in recording conditions and quality, whereas the corpus
curated for speech synthesis are generally small in size and not suitable for
training state-of-the-art end-to-end models. In a move towards filling this gap
in resources, we present a speech corpus for Classical Arabic Text-to-Speech
(ClArTTS) to support the development of end-to-end TTS systems for Arabic. The
speech is extracted from a LibriVox audiobook, which is then processed,
segmented, and manually transcribed and annotated. The final ClArTTS corpus
contains about 12 hours of speech from a single male speaker sampled at 40100
kHz. In this paper, we describe the process of corpus creation and provide
details of corpus statistics and a comparison with existing resources.
Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and
illustrate the performance of the resulting systems via subjective and
objective evaluations. The corpus will be made publicly available at
www.clartts.com for research purposes, along with the baseline TTS systems
demo.
Related papers
- LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning [12.069474749489897]
We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics.
Results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset.
arXiv Detail & Related papers (2024-06-12T07:49:21Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep
Architecture [0.0]
Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers.
This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture.
arXiv Detail & Related papers (2020-07-22T17:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.