AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
- URL: http://arxiv.org/abs/2302.08956v5
- Date: Sat, 4 Nov 2023 19:48:38 GMT
- Title: AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
- Authors: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele,
Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id
Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane,
Pavel Brazdil, Felermino D\'ario M\'ario Ant\'onio Ali, Davis David, Salomey
Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda,
Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna
Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, Steven Arthur
- Abstract summary: Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
- Score: 45.88640066767242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Africa is home to over 2,000 languages from more than six language families
and has the highest linguistic diversity among all continents. These include 75
languages with at least one million speakers each. Yet, there is little NLP
research conducted on African languages. Crucial to enabling such research is
the availability of high-quality annotated datasets. In this paper, we
introduce AfriSenti, a sentiment analysis benchmark that contains a total of
>110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo,
Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo,
Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families.
The tweets were annotated by native speakers and used in the AfriSenti-SemEval
shared task (The AfriSenti Shared Task had over 200 participants. See website
at https://afrisenti-semeval.github.io). We describe the data collection
methodology, annotation process, and the challenges we dealt with when curating
each dataset. We further report baseline experiments conducted on the different
datasets and discuss their usefulness.
Related papers
- WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.
We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource
TweetData for Sentiment Analysis [0.0]
We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset.
Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages.
arXiv Detail & Related papers (2023-04-26T15:47:50Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.