Byakto Speech: Real-time long speech synthesis with convolutional neural
network: Transfer learning from English to Bangla
- URL: http://arxiv.org/abs/2106.03937v1
- Date: Mon, 31 May 2021 20:39:35 GMT
- Title: Byakto Speech: Real-time long speech synthesis with convolutional neural
network: Transfer learning from English to Bangla
- Authors: Zabir Al Nazi, Sayed Mohammed Tasmimul Huda
- Abstract summary: Byakta is the first-ever open-source deep learning-based bilingual (Bangla and English) text to a speech synthesis system.
A speech recognition model-based automated scoring metric was also proposed to evaluate the performance of a TTS model.
We introduce a test benchmark dataset for Bangla speech synthesis models for evaluating speech quality.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech synthesis is one of the challenging tasks to automate by deep
learning, also being a low-resource language there are very few attempts at
Bangla speech synthesis. Most of the existing works can't work with anything
other than simple Bangla characters script, very short sentences, etc. This
work attempts to solve these problems by introducing Byakta, the first-ever
open-source deep learning-based bilingual (Bangla and English) text to a speech
synthesis system. A speech recognition model-based automated scoring metric was
also proposed to evaluate the performance of a TTS model. We also introduce a
test benchmark dataset for Bangla speech synthesis models for evaluating speech
quality. The TTS is available at https://github.com/zabir-nabil/bangla-tts
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning [6.544954579068865]
We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
arXiv Detail & Related papers (2023-12-02T10:52:00Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech
Recognition [60.84668086976436]
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language.
This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR)
Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each.
arXiv Detail & Related papers (2022-03-29T17:57:53Z) - Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based
on Transfer Learning [0.802904964931021]
The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space.
This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
arXiv Detail & Related papers (2021-02-10T18:43:56Z) - A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep
Architecture [0.0]
Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers.
This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture.
arXiv Detail & Related papers (2020-07-22T17:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.