IMaSC -- ICFOSS Malayalam Speech Corpus
- URL: http://arxiv.org/abs/2211.12796v1
- Date: Wed, 23 Nov 2022 09:21:01 GMT
- Title: IMaSC -- ICFOSS Malayalam Speech Corpus
- Authors: Deepa P Gopinath, Thennal D K, Vrinda V Nair, Swaraj K S, Sachin G
- Abstract summary: We present IMaSC, a Malayalam text and speech corpora containing approximately 50 hours of recorded speech.
With 8 speakers and a total of 34,473 text-audio pairs, IMaSC is larger than every other publicly available alternative.
We show that our models perform significantly better in terms of naturalness compared to previous studies and publicly available models, with an average mean opinion score of 4.50.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Modern text-to-speech (TTS) systems use deep learning to synthesize speech
increasingly approaching human quality, but they require a database of high
quality audio-text sentence pairs for training. Malayalam, the official
language of the Indian state of Kerala and spoken by 35+ million people, is a
low resource language in terms of available corpora for TTS systems. In this
paper, we present IMaSC, a Malayalam text and speech corpora containing
approximately 50 hours of recorded speech. With 8 speakers and a total of
34,473 text-audio pairs, IMaSC is larger than every other publicly available
alternative. We evaluated the database by using it to train TTS models for each
speaker based on a modern deep learning architecture. Via subjective
evaluation, we show that our models perform significantly better in terms of
naturalness compared to previous studies and publicly available models, with an
average mean opinion score of 4.50, indicating that the synthesized speech is
close to human quality.
Related papers
- SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus [3.1925030748447747]
We present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic.
The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated.
The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz.
arXiv Detail & Related papers (2023-02-28T20:18:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages.
We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis [3.6406488220483317]
RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems.
It contains over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.
arXiv Detail & Related papers (2021-06-15T22:24:38Z) - Byakto Speech: Real-time long speech synthesis with convolutional neural
network: Transfer learning from English to Bangla [0.0]
Byakta is the first-ever open-source deep learning-based bilingual (Bangla and English) text to a speech synthesis system.
A speech recognition model-based automated scoring metric was also proposed to evaluate the performance of a TTS model.
We introduce a test benchmark dataset for Bangla speech synthesis models for evaluating speech quality.
arXiv Detail & Related papers (2021-05-31T20:39:35Z) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset [4.542831770689362]
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
arXiv Detail & Related papers (2021-04-17T05:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.