IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
for Indian Languages
- URL: http://arxiv.org/abs/2403.01926v1
- Date: Mon, 4 Mar 2024 10:42:08 GMT
- Title: IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
for Indian Languages
- Authors: Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi,
Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna
Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya
Sukumaran, Tripura Panchagnula, Sunjay Murali, Kunal Sharad Gandhi,
Ambujavalli R, Manickam K M, C Venkata Vaijayanthi, Krishnan Srinivasa
Raghavan Karunganni, Pratyush Kumar, Mitesh M Khapra
- Abstract summary: INDICVOICES is a dataset of natural and spontaneous speech from 16237 speakers covering 145 Indian districts and 22 languages.
1639 hours have already been transcribed, with a median of 73 hours per language.
All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.
- Score: 17.862027695142825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech
containing a total of 7348 hours of read (9%), extempore (74%) and
conversational (17%) audio from 16237 speakers covering 145 Indian districts
and 22 languages. Of these 7348 hours, 1639 hours have already been
transcribed, with a median of 73 hours per language. Through this paper, we
share our journey of capturing the cultural, linguistic and demographic
diversity of India to create a one-of-its-kind inclusive and representative
dataset. More specifically, we share an open-source blueprint for data
collection at scale comprising of standardised protocols, centralised tools, a
repository of engaging questions, prompts and conversation scenarios spanning
multiple domains and topics of interest, quality control mechanisms,
comprehensive transcription guidelines and transcription tools. We hope that
this open source blueprint will serve as a comprehensive starter kit for data
collection efforts in other multilingual regions of the world. Using
INDICVOICES, we build IndicASR, the first ASR model to support all the 22
languages listed in the 8th schedule of the Constitution of India. All the
data, tools, guidelines, models and other materials developed as a part of this
work will be made publicly available
Related papers
- BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages [27.273651323572786]
We evaluate the performance of widely-used Automatic Speech Translation systems on Indian languages.
There is a striking absence of systems capable of accurately translating colloquial and informal language.
We introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 14 scheduled Indian languages.
arXiv Detail & Related papers (2024-11-07T13:33:34Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,
IIT Madras [1.4699314771635081]
Building speech based applications for the Indian population is a difficult problem owing to limited data and the number of languages and accents to accommodate.
We are open sourcing SPRING-INX data which has about 2000 hours of legally sourced and manually transcribed speech data for ASR system building in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi and Tamil.
arXiv Detail & Related papers (2023-10-23T07:50:10Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Annotated Speech Corpus for Low Resource Indian Languages: Awadhi,
Bhojpuri, Braj and Magahi [2.84214511742034]
We develop a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi.
The total size of the corpus currently stands at approximately 18 hours.
We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic.
arXiv Detail & Related papers (2022-06-26T17:28:38Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.