Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
- URL: http://arxiv.org/abs/2404.19442v4
- Date: Mon, 10 Feb 2025 15:08:20 GMT
- Title: Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs
- Authors: David Ifeoluwa Adelani, A. Seza Doğruöz, Iyanuoluwa Shode, Anuoluwapo Aremu,
- Abstract summary: Naija is a Nigerian Pidgin spoken by approximately 120M speakers.
West African Pidgin English (WAPE) is also spoken in Nigeria.
- Score: 8.829688681748413
- License:
- Abstract: Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian Pidgin spoken by approximately 120M speakers and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are some online platforms (e.g., Wikipedia), publishing in written Naija as well. West African Pidgin English (WAPE) is also spoken in Nigeria and it is used by BBC to broadcast news on the internet to a wider audience not only in Nigeria but also in other West African countries (e.g., Cameroon and Ghana). Through statistical analyses and Machine Translation experiments, our paper shows that these two pidgin varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is underrepresented in Generative AI, and it is hard to teach LLMs with few examples. In addition to the statistical analyses, we also provide historical information on both pidgins as well as insights from the interviews conducted with volunteer Wikipedia contributors in Naija.
Related papers
- Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages [51.96666324242191]
We analyze whether user utilization of novel writing assistants in a charity advertisement writing task is affected by the AI's performance in a second language.
We quantify the extent to which these patterns translate into the persuasiveness of generated charity advertisements.
arXiv Detail & Related papers (2025-02-13T17:49:30Z) - Nollywood: Let's Go to the Movies! [3.818480245025447]
We create a phonetic sub-title model that is able to translate Nigerian English speech to American English.
We also use the most advanced toxicity detectors to discover how toxic the speech is.
Our aim is to highlight the text in these videos which is often times ignored for lack of dialectal understanding.
arXiv Detail & Related papers (2024-07-02T19:50:55Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - Towards End-to-End Training of Automatic Speech Recognition for Nigerian
Pidgin [0.0]
Nigerian pidgin is one of the most popular languages in West Africa.
We present the first parallel (speech-to-text) data on Nigerian pidgin.
We also trained the first end-to-end speech recognition system on this language.
arXiv Detail & Related papers (2020-10-21T16:32:58Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z) - Towards Supervised and Unsupervised Neural Machine Translation Baselines
for Nigerian Pidgin [0.2792030485253753]
Nigerian Pidgin is arguably the most widely spoken language in Nigeria. Variants of this language are also spoken across West and Central Africa.
This work aims to establish supervised and unsupervised neural machine translation baselines between English and Nigerian Pidgin.
arXiv Detail & Related papers (2020-03-27T22:40:01Z) - Towards Neural Machine Translation for Edoid Languages [2.144787054581292]
Many Nigerian languages have relinquished their previous prestige and purpose in modern society to English and Nigerian Pidgin.
This work explores the feasibility of Neural Machine Translation for the Edoid language family of Southern Nigeria.
arXiv Detail & Related papers (2020-03-24T07:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.