IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages
- URL: http://arxiv.org/abs/2208.11761v1
- Date: Wed, 24 Aug 2022 20:14:52 GMT
- Title: IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages
- Authors: Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop
Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Abstract summary: We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
- Score: 16.121708272597154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A cornerstone in AI research has been the creation and adoption of
standardized training and test datasets to earmark the progress of
state-of-the-art models. A particularly successful example is the GLUE dataset
for training and evaluating Natural Language Understanding (NLU) models for
English. The large body of research around self-supervised BERT-based language
models revolved around performance improvements on NLU tasks in GLUE. To
evaluate language models in other languages, several language-specific GLUE
datasets were created. The area of speech language understanding (SLU) has
followed a similar trajectory. The success of large self-supervised models such
as wav2vec2 enable creation of speech models with relatively easy to access
unlabelled data. These models can then be evaluated on SLU tasks, such as the
SUPERB benchmark. In this work, we extend this to Indic languages by releasing
the IndicSUPERB benchmark. Specifically, we make the following three
contributions. (i) We collect Kathbath containing 1,684 hours of labelled
speech data across 12 Indian languages from 1,218 contributors located in 203
districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech
tasks: Automatic Speech Recognition, Speaker Verification, Speaker
Identification (mono/multi), Language Identification, Query By Example, and
Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train
and evaluate different self-supervised models alongside a commonly used
baseline FBANK. We show that language-specific fine-tuned models are more
accurate than baseline on most of the tasks, including a large gap of 76\% for
the Language Identification task. However, for speaker identification,
self-supervised models trained on large datasets demonstrate an advantage. We
hope IndicSUPERB contributes to the progress of developing speech language
understanding models for Indian languages.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Improving Massively Multilingual ASR With Auxiliary CTC Objectives [40.10307386370194]
We introduce our work on improving performance on FLEURS, a 102-language open ASR benchmark.
We investigate techniques inspired from recent Connectionist Temporal Classification ( CTC) studies to help the model handle the large number of languages.
Our state-of-the-art systems using self-supervised models with the Conformer architecture improve over the results of prior work on FLEURS by a relative 28.4% CER.
arXiv Detail & Related papers (2023-02-24T18:59:51Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages.
We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z) - Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks.
However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German.
Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.