Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
- URL: http://arxiv.org/abs/2305.15386v2
- Date: Wed, 2 Aug 2023 13:29:31 GMT
- Title: Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
- Authors: Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed,
Mitesh M. Khapra, Pratyush Kumar
- Abstract summary: We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark.
IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER.
We open-source all datasets, code and models.
- Score: 14.15737970309719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving ASR systems is necessary to make new LLM-based use-cases accessible
to people across the globe. In this paper, we focus on Indian languages, and
make the case that diverse benchmarks are required to evaluate and improve ASR
systems for Indian languages. To address this, we collate Vistaar as a set of
59 benchmarks across various language and domain combinations, on which we
evaluate 3 publicly available ASR systems and 2 commercial systems. We also
train IndicWhisper models by fine-tuning the Whisper models on publicly
available training datasets across 12 Indian languages totalling to 10.7K
hours. We show that IndicWhisper significantly improves on considered ASR
systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39
out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source
all datasets, code and models.
Related papers
- LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Svarah: Evaluating English ASR Systems on Indian Accents [12.197514367387692]
We create Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 geographic locations throughout India.
Svarah comprises both read speech and spontaneous conversational data, covering various domains, such as history, culture, tourism, etc., ensuring a diverse vocabulary.
We evaluate 6 open source ASR models and 2 commercial ASR systems on Svarah and show that there is clear scope for improvement on Indian accents.
arXiv Detail & Related papers (2023-05-25T06:20:29Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Effectiveness of Mining Audio and Text Pairs from Public Data for
Improving ASR Systems for Low-Resource Languages [15.214673043019395]
We create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages.
On average, Shrutilipi results in a 2.3x increase over publicly available labelled data.
We show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5.8% for 7 languages.
arXiv Detail & Related papers (2022-08-26T13:37:45Z) - MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question
Answering for 16 Diverse Languages [54.002969723086075]
We evaluate cross-lingual open-retrieval question answering systems in 16 typologically diverse languages.
The best system leveraging iteratively mined diverse negative examples achieves 32.2 F1, outperforming our baseline by 4.5 points.
The second best system uses entity-aware contextualized representations for document retrieval, and achieves significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores.
arXiv Detail & Related papers (2022-07-02T06:54:10Z) - Towards Building ASR Systems for the Next Billion Users [15.867823754118422]
We make contributions towards building ASR systems for low resource languages from the Indian subcontinent.
First, we curate 17,000 hours of raw speech data for 40 Indian languages.
Using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
arXiv Detail & Related papers (2021-11-06T19:34:33Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.