HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource
TweetData for Sentiment Analysis
- URL: http://arxiv.org/abs/2304.13634v1
- Date: Wed, 26 Apr 2023 15:47:50 GMT
- Title: HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource
TweetData for Sentiment Analysis
- Authors: Saheed Abdullahi Salahudeen, Falalu Ibrahim Lawan, Ahmad Mustapha
Wali, Amina Abubakar Imam, Aliyu Rabiu Shuaibu, Aliyu Yusuf, Nur Bala Rabiu,
Musa Bello, Shamsuddeen Umaru Adamu, Saminu Mohammad Aliyu, Murja Sani
Gadanya, Sanah Abdullahi Muaz, Mahmoud Said Ahmad, Abdulkadir Abdullahi,
Abdulmalik Yusuf Jamoh
- Abstract summary: We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset.
Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present the findings of SemEval-2023 Task 12, a shared task on sentiment
analysis for low-resource African languages using Twitter dataset. The task
featured three subtasks; subtask A is monolingual sentiment classification with
12 tracks which are all monolingual languages, subtask B is multilingual
sentiment classification using the tracks in subtask A and subtask C is a
zero-shot sentiment classification. We present the results and findings of
subtask A, subtask B and subtask C. We also release the code on github. Our
goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large,
AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert),
Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African
languages. The datasets for these subtasks consists of a gold standard
multi-class labeled Twitter datasets from these languages. Our results
demonstrate that Afro-xlmr-large model performed better compared to the other
models in most of the languages datasets. Similarly, Nigerian languages: Hausa,
Igbo, and Yoruba achieved better performance compared to other languages and
this can be attributed to the higher volume of data present in the languages.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Izindaba-Tindzaba: Machine learning news categorisation for Long and
Short Text for isiZulu and Siswati [1.666378501554705]
Local/Native South African languages are classified as low-resource languages.
In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages.
arXiv Detail & Related papers (2023-06-12T21:02:12Z) - Cross-Lingual Knowledge Distillation for Answer Sentence Selection in
Low-Resource Languages [90.41827664700847]
We propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages.
To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages.
arXiv Detail & Related papers (2023-05-25T17:56:04Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages [0.0]
We show how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining.
The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching.
We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.
arXiv Detail & Related papers (2023-04-27T13:51:18Z) - Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using
Afro-centric Language Models and Adapters for Low-resource African Languages [0.0]
The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B) and zero-shot sentiment classification (task C)
Our findings suggest that using pre-trained Afro-centric language models improves performance for low-resource African languages.
We also ran experiments using adapters for zero-shot tasks, and the results suggest that we can obtain promising results by using adapters with a limited amount of resources.
arXiv Detail & Related papers (2023-04-13T12:54:29Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.