Lugha-Llama: Adapting Large Language Models for African Languages
- URL: http://arxiv.org/abs/2504.06536v1
- Date: Wed, 09 Apr 2025 02:25:53 GMT
- Title: Lugha-Llama: Adapting Large Language Models for African Languages
- Authors: Happy Buzaaba, Alexander Wettig, David Ifeoluwa Adelani, Christiane Fellbaum,
- Abstract summary: Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.<n>We consider how to adapt LLMs to low-resource African languages.<n>We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
- Score: 48.97516583523523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. However, they often struggle to recognize low-resource languages, in particular African languages, which are not well represented in large training corpora. In this paper, we consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages. On the challenging IrokoBench dataset, our models consistently achieve the best performance amongst similarly sized baselines, particularly on knowledge-intensive multiple-choice questions (AfriMMLU). Additionally, on the cross-lingual question answering benchmark AfriQA, our models outperform the base model by over 10%. To better understand the role of English data during training, we translate a subset of 200M tokens into Swahili language and perform an analysis which reveals that the content of these data is primarily responsible for the strong performance. We release our models and data to encourage future research on African languages.
Related papers
- IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.083861654053585]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages.<n>We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and six proprietary language models.<n>We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages.<n>AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - How Good are Commercial Large Language Models on African Languages? [0.012691047660244334]
We present a preliminary analysis of commercial large language models on two tasks (machine translation and text classification) across eight African languages.
Our results suggest that commercial language models produce below-par performance on African languages.
In general, our findings present a call-to-action to ensure African languages are well represented in commercial large language models.
arXiv Detail & Related papers (2023-05-11T02:29:53Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.