RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
- URL: http://arxiv.org/abs/2505.18609v2
- Date: Tue, 27 May 2025 14:19:23 GMT
- Title: RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
- Authors: Ashwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M Khapra,
- Abstract summary: We introduce RASMALAI, a large-scale speech dataset with rich text descriptions.<n>We develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages.
- Score: 15.198945496921914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.
Related papers
- NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations [7.55995559331834]
Paralinguistic vocalizations are integral to natural spoken communication.<n> NVSpeech bridges the recognition and synthesis of paralinguistic vocalizations.<n> NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin.
arXiv Detail & Related papers (2025-08-06T08:25:26Z) - Optimizing Multilingual Text-To-Speech with Accents & Emotions [0.0]
This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling.<n>Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture.<n>Tests demonstrate 23.7% improvement in accent accuracy and 85.3% emotion recognition accuracy from native listeners.
arXiv Detail & Related papers (2025-06-19T13:35:05Z) - Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages [6.74683227658822]
India has 1369 languages, with 22 official using 13 scripts.<n>Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families.<n>Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh.
arXiv Detail & Related papers (2025-06-04T12:22:24Z) - CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.10844677737886]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z) - BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages [27.273651323572786]
We evaluate the performance of widely-used Automatic Speech Translation systems on Indian languages.
There is a striking absence of systems capable of accurately translating colloquial and informal language.
We introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English.
arXiv Detail & Related papers (2024-11-07T13:33:34Z) - SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description [19.064845530513285]
We propose an automatic speech annotation system for interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions.
Our system provides in-depth understandings of speech style through tailored natural language descriptions.
It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips.
arXiv Detail & Related papers (2024-08-24T15:36:08Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.