Related papers: Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

URL: http://arxiv.org/abs/2510.07203v1
Date: Wed, 08 Oct 2025 16:35:53 GMT
Title: Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models
Authors: Benjamin Akera, Evelyn Nafula Ouma, Gilbert Yiga, Patrick Walukagga, Phionah Natukunda, Trevor Saaka, Solomon Nsumba, Lilian Teddy Nabukeera, Joel Muhanguzi, Imran Sekalala, Nimpamya Janat Namara, Engineer Bainomugisha, Ernest Mwebaze, John Quinn,
Abstract summary: There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology.<n>Current leading LLMs exhibit strong performance on a number of the most common languages, but prioritise support for the languages with the most speakers first.<n>We contend that a regionally focussed approach is more efficient, and present a case study for Uganda.
Score: 1.6095610725007592
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

Related papers

Multilingual Large Language Models do not comprehend all natural languages to equal degrees [3.1312895682585595]
Large Language Models (LLMs) play a critical role in how humans access information.<n>Most benchmarks evaluate LLMs in languages spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities.<n>We prompt 3 popular models on a language comprehension task across 12 languages.<n>Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages.
arXiv Detail & Related papers (2026-02-23T17:22:46Z)
FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models [1.2403152094314245]
We introduce FORMOSANBENCH, the first benchmark for evaluating large language models (LLMs) on low-resource Austronesian languages.<n>We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH.<n>Our results reveal a substantial performance gap between high-resource and Formosan languages.
arXiv Detail & Related papers (2025-06-12T07:02:28Z)
Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.<n>We consider how to adapt LLMs to low-resource African languages.<n>We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
We propose Lens, a novel approach to enhance multilingual capabilities in large language models (LLMs)<n>Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity.<n>Lens significantly improves multilingual performance while maintaining the model's English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
\`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages. The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a. The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z)
Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT. Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z)
Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology. For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z)
The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages. It is estimated that over 100 million people speak the language. We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)
Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study. We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.