Cheetah: Natural Language Generation for 517 African Languages
- URL: http://arxiv.org/abs/2401.01053v3
- Date: Wed, 10 Jan 2024 17:43:23 GMT
- Title: Cheetah: Natural Language Generation for 517 African Languages
- Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed
- Abstract summary: We develop Cheetah, a massively multilingual NLG language model for African languages.
Cheetah supports 517 African languages and language varieties.
The introduction of Cheetah has far-reaching benefits for linguistic diversity.
- Score: 21.347462833831223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-resource African languages pose unique challenges for natural language
processing (NLP) tasks, including natural language generation (NLG). In this
paper, we develop Cheetah, a massively multilingual NLG language model for
African languages. Cheetah supports 517 African languages and language
varieties, allowing us to address the scarcity of NLG resources and provide a
solution to foster linguistic diversity. We demonstrate the effectiveness of
Cheetah through comprehensive evaluations across six generation downstream
tasks. In five of the six tasks, Cheetah significantly outperforms other
models, showcasing its remarkable performance for generating coherent and
contextually appropriate text in a wide range of African languages. We
additionally conduct a detailed human evaluation to delve deeper into the
linguistic capabilities of Cheetah. The introduction of Cheetah has
far-reaching benefits for linguistic diversity. By leveraging pretrained models
and adapting them to specific languages, our approach facilitates the
development of practical NLG applications for African communities. The findings
of this study contribute to advancing NLP research in low-resource settings,
enabling greater accessibility and inclusion for African languages in a rapidly
expanding digital landscape. We publicly release our models for research.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.260317326787035]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages.
We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary language models.
We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z) - The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions [49.97641297850361]
LINGOLLM is a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training.
We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages.
Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions.
arXiv Detail & Related papers (2024-02-28T03:44:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Multilingual Language Model Adaptive Fine-Tuning: A Study on African
Languages [19.067718464786463]
We perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent.
To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT.
Our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space.
arXiv Detail & Related papers (2022-04-13T16:13:49Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.