AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages
- URL: http://arxiv.org/abs/2211.03263v1
- Date: Mon, 7 Nov 2022 02:15:25 GMT
- Title: AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages
- Authors: Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey
Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, Chris
Chinenye Emezue
- Abstract summary: We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
- Score: 0.021987601456703476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, multilingual pre-trained language models have gained
prominence due to their remarkable performance on numerous downstream Natural
Language Processing tasks (NLP). However, pre-training these large multilingual
language models requires a lot of training data, which is not available for
African Languages. Active learning is a semi-supervised learning algorithm, in
which a model consistently and dynamically learns to identify the most
beneficial samples to train itself on, in order to achieve better optimization
and performance on downstream tasks. Furthermore, active learning effectively
and practically addresses real-world data scarcity. Despite all its benefits,
active learning, in the context of NLP and especially multilingual language
models pretraining, has received little consideration. In this paper, we
present AfroLM, a multilingual language model pretrained from scratch on 23
African languages (the largest effort to date) using our novel self-active
learning framework. Pretrained on a dataset significantly (14x) smaller than
existing baselines, AfroLM outperforms many multilingual pretrained language
models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text
classification, and sentiment analysis). Additional out-of-domain sentiment
analysis experiments show that \textbf{AfroLM} is able to generalize well
across various domains. We release the code source, and our datasets used in
our framework at https://github.com/bonaventuredossou/MLM_AL.
Related papers
- Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language [7.289015788793582]
This work focuses on increasing technological participation for the S'ami language.
We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages.
We have compiled the available S'ami language resources from the web to create a clean dataset for training language models.
arXiv Detail & Related papers (2024-05-09T13:54:22Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Improving Language Plasticity via Pretraining with Active Forgetting [63.36484652568976]
We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.
Experiments with RoBERTa show that models pretrained with our forgetting mechanism demonstrate faster convergence during language adaptation.
arXiv Detail & Related papers (2023-07-03T17:12:44Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models
for African News Translation [25.05948665615943]
We create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset.
We show that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
arXiv Detail & Related papers (2022-05-04T12:11:47Z) - Multilingual Language Model Adaptive Fine-Tuning: A Study on African
Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent.
To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT.
Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.