SERENGETI: Massively Multilingual Language Models for Africa
- URL: http://arxiv.org/abs/2212.10785v2
- Date: Fri, 26 May 2023 20:20:04 GMT
- Title: SERENGETI: Massively Multilingual Language Models for Africa
- Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides
Alcoba Inciarte
- Abstract summary: We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
- Score: 5.945320097465418
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual pretrained language models (mPLMs) acquire valuable,
generalizable linguistic information during pretraining and have advanced the
state of the art on task-specific finetuning. To date, only ~31 out of ~2,000
African languages are covered in existing language models. We ameliorate this
limitation by developing SERENGETI, a massively multilingual language model
that covers 517 African languages and language varieties. We evaluate our novel
models on eight natural language understanding tasks across 20 datasets,
comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms
other models on 11 datasets across the eights tasks, achieving 82.27 average
F_1. We also perform analyses of errors from our models, which allows us to
investigate the influence of language genealogy and linguistic similarity when
the models are applied under zero-shot settings. We will publicly release our
models for
research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}
Related papers
- EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation [24.060772057458685]
This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English.
We evaluate the performance of these models across five downstream Natural Language Processing (NLP) tasks.
arXiv Detail & Related papers (2024-03-20T16:43:42Z) - Aya Model: An Instruction Finetuned Open-Access Multilingual Language
Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced.
We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages.
We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.