BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages
- URL: http://arxiv.org/abs/2411.04699v2
- Date: Fri, 08 Nov 2024 14:29:03 GMT
- Title: BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages
- Authors: Sparsh Jain, Ashwin Sankar, Devilal Choudhary, Dhairya Suman, Nikhil Narasimhan, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M Khapra, Raj Dabre,
- Abstract summary: We evaluate the performance of widely-used Automatic Speech Translation systems on Indian languages.
There is a striking absence of systems capable of accurately translating colloquial and informal language.
We introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English.
- Score: 27.273651323572786
- License:
- Abstract: Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for Indian languages.
Related papers
- Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
for Indian Languages [17.862027695142825]
INDICVOICES is a dataset of natural and spontaneous speech from 16237 speakers covering 145 Indian districts and 22 languages.
1639 hours have already been transcribed, with a median of 73 hours per language.
All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.
arXiv Detail & Related papers (2024-03-04T10:42:08Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.