Towards Leaving No Indic Language Behind: Building Monolingual Corpora,
Benchmark and Models for Indic Languages
- URL: http://arxiv.org/abs/2212.05409v3
- Date: Wed, 24 May 2023 17:05:16 GMT
- Title: Towards Leaving No Indic Language Behind: Building Monolingual Corpora,
Benchmark and Models for Indic Languages
- Authors: Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal,
Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
- Abstract summary: We aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes.
Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families.
We create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages.
Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.
- Score: 19.91781398526369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building Natural Language Understanding (NLU) capabilities for Indic
languages, which have a collective speaker base of more than one billion
speakers is absolutely crucial. In this work, we aim to improve the NLU
capabilities of Indic languages by making contributions along 3 important axes
(i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on
Indic languages. Specifically, we curate the largest monolingual corpora,
IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a
2.3x increase over prior work, while supporting 12 additional languages. Next,
we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse
NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME
contains a total of 105 evaluation sets, of which 52 are new contributions to
the literature. To the best of our knowledge, this is the first effort towards
creating a standard benchmark for Indic languages that aims to test the
multilingual zero-shot capabilities of pretrained language models. Finally, we
train IndicBERT v2, a state-of-the-art model supporting all the languages.
Averaged across languages and tasks, the model achieves an absolute improvement
of 2 points over a strong baseline. The data and models are available at
https://github.com/AI4Bharat/IndicBERT.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages [12.514648269553104]
IndicGenBench is the largest benchmark for evaluating large language models (LLMs)
It is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering.
The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English.
arXiv Detail & Related papers (2024-04-25T17:57:36Z) - Aya Model: An Instruction Finetuned Open-Access Multilingual Language
Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced.
We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages.
We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages [21.018996007110324]
This dataset includes 41.8 million news articles in 14 different Indic languages (and English)
To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available.
arXiv Detail & Related papers (2023-05-10T03:07:17Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages [34.79533646549939]
We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
arXiv Detail & Related papers (2021-09-22T06:37:39Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.