IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages
- URL: http://arxiv.org/abs/2208.11761v1
- Date: Wed, 24 Aug 2022 20:14:52 GMT
- Title: IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages
- Authors: Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop
Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
- Abstract summary: We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
- Score: 16.121708272597154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A cornerstone in AI research has been the creation and adoption of
standardized training and test datasets to earmark the progress of
state-of-the-art models. A particularly successful example is the GLUE dataset
for training and evaluating Natural Language Understanding (NLU) models for
English. The large body of research around self-supervised BERT-based language
models revolved around performance improvements on NLU tasks in GLUE. To
evaluate language models in other languages, several language-specific GLUE
datasets were created. The area of speech language understanding (SLU) has
followed a similar trajectory. The success of large self-supervised models such
as wav2vec2 enable creation of speech models with relatively easy to access
unlabelled data. These models can then be evaluated on SLU tasks, such as the
SUPERB benchmark. In this work, we extend this to Indic languages by releasing
the IndicSUPERB benchmark. Specifically, we make the following three
contributions. (i) We collect Kathbath containing 1,684 hours of labelled
speech data across 12 Indian languages from 1,218 contributors located in 203
districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech
tasks: Automatic Speech Recognition, Speaker Verification, Speaker
Identification (mono/multi), Language Identification, Query By Example, and
Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train
and evaluate different self-supervised models alongside a commonly used
baseline FBANK. We show that language-specific fine-tuned models are more
accurate than baseline on most of the tasks, including a large gap of 76\% for
the Language Identification task. However, for speaker identification,
self-supervised models trained on large datasets demonstrate an advantage. We
hope IndicSUPERB contributes to the progress of developing speech language
understanding models for Indian languages.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research.
This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets.
Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - Towards Building Text-To-Speech Systems for the Next Billion Users [18.290165216270452]
We evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages.
We train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores.
arXiv Detail & Related papers (2022-11-17T13:59:34Z) - Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks.
However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German.
Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.