SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras
- URL: http://arxiv.org/abs/2310.14654v2
- Date: Tue, 24 Oct 2023 06:03:14 GMT
- Title: SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras
- Authors: Nithya R, Malavika S, Jordan F, Arjun Gangwar, Metilda N J, S Umesh,
Rithik Sarab, Akhilesh Kumar Dubey, Govind Divakaran, Samudra Vijaya K,
Suryakanth V Gangashetty
- Abstract summary: Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate.
We are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil.
- Score: 1.4699314771635081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: India is home to a multitude of languages of which 22 languages are
recognised by the Indian Constitution as official. Building speech based
applications for the Indian population is a difficult problem owing to limited
data and the number of languages and accents to accommodate. To encourage the
language technology community to build speech based applications in Indian
languages, we are open sourcing SPRING-INX data which has about 2000 hours of
legally sourced and manually transcribed speech data for ASR system building in
Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi
and Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology
Madras and is a part of National Language Translation Mission (NLTM), funded by
the Indian Ministry of Electronics and Information Technology (MeitY),
Government of India. We describe the data collection and data cleaning process
along with the data statistics in this paper.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
for Indian Languages [17.862027695142825]
INDICVOICES is a dataset of natural and spontaneous speech from 16237 speakers covering 145 Indian districts and 22 languages.
1639 hours have already been transcribed, with a median of 73 hours per language.
All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.
arXiv Detail & Related papers (2024-03-04T10:42:08Z) - IndicIRSuite: Multilingual Dataset and Neural Information Models for
Indian Languages [42.50384290676914]
In this paper, we introduce Neural Information Retrieval resources for 11 widely spoken Indian languages.
These resources include (a) INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian Languages created using Machine Translation, and (b) Indic-ColBERT, a collection of 11 distinct Monolingual Neural Information Retrieval models.
IndicIRSuite is the first attempt at building large-scale Neural Information Retrieval resources for a large number of Indian languages.
arXiv Detail & Related papers (2023-12-15T03:19:53Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - Taxonomic survey of Hindi Language NLP systems [0.0]
Natural Language processing (NLP) represents the task of automatic handling of natural human language by machines.
This survey gives a report of the resources and applications available for Hindi language NLP.
arXiv Detail & Related papers (2021-01-30T11:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.