Related papers: A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

URL: http://arxiv.org/abs/2007.11541v1
Date: Wed, 22 Jul 2020 17:03:18 GMT
Title: A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture
Authors: Fady Fahmy, Mahmoud Khalil, Hazem Abbas
Abstract summary: Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech synthesis is the artificial production of human speech. A typical text-to-speech system converts a language text into a waveform. There exist many English TTS systems that produce mature, natural, and human-like speech synthesizers. In contrast, other languages, including Arabic, have not been considered until recently. Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. They also lack essential speech key factors such as intonation, stress, and rhythm. Different works were proposed to solve those issues, including the use of concatenative methods such as unit selection or parametric methods. However, they required a lot of laborious work and domain expertise. Another reason for such poor performance of Arabic speech synthesizers is the lack of speech corpora, unlike English that has many publicly available corpora and audiobooks. This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture. This work uses just $\langle$ text, audio $\rangle$ pairs with a relatively small amount of recorded audio samples with a total of 2.41 hours. It illustrates how to use English character embedding despite using diacritic Arabic characters as input and how to preprocess these audio samples to achieve the best results.

Related papers

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage [7.096838107088313]
DisfluencySpeech is a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard)
arXiv Detail & Related papers (2024-06-13T05:23:22Z)
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech. We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z)
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS) Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z)
Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla [0.0]
Byakta is the first-ever open-source deep learning-based bilingual (Bangla and English) text to a speech synthesis system. A speech recognition model-based automated scoring metric was also proposed to evaluate the performance of a TTS model. We introduce a test benchmark dataset for Bangla speech synthesis models for evaluating speech quality.
arXiv Detail & Related papers (2021-05-31T20:39:35Z)
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.