Sentiment Analysis Across Multiple African Languages: A Current
Benchmark
- URL: http://arxiv.org/abs/2310.14120v1
- Date: Sat, 21 Oct 2023 21:38:06 GMT
- Title: Sentiment Analysis Across Multiple African Languages: A Current
Benchmark
- Authors: Saurav K. Aryal, Howard Prioleau, Surakshya Aryal
- Abstract summary: An annotated sentiment analysis of 14 African languages was made available.
We benchmarked and compared current state-of-art transformer models across 12 languages.
Our results show that despite work in low resource modeling, more data still produces better models on a per-language basis.
- Score: 5.701291200264771
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sentiment analysis is a fundamental and valuable task in NLP. However, due to
limitations in data and technological availability, research into sentiment
analysis of African languages has been fragmented and lacking. With the recent
release of the AfriSenti-SemEval Shared Task 12, hosted as a part of The 17th
International Workshop on Semantic Evaluation, an annotated sentiment analysis
of 14 African languages was made available. We benchmarked and compared current
state-of-art transformer models across 12 languages and compared the
performance of training one-model-per-language versus
single-model-all-languages. We also evaluated the performance of standard
multilingual models and their ability to learn and transfer cross-lingual
representation from non-African to African languages. Our results show that
despite work in low resource modeling, more data still produces better models
on a per-language basis. Models explicitly developed for African languages
outperform other models on all tasks. Additionally, no one-model-fits-all
solution exists for a per-language evaluation of the models evaluated.
Moreover, for some languages with a smaller sample size, a larger multilingual
model may perform better than a dedicated per-language model for sentiment
classification.
Related papers
- InkubaLM: A small language model for low-resource African languages [9.426968756845389]
InkubaLM is a small language model with 0.4 billion parameters.
It achieves performance comparable to models with significantly larger parameter counts.
It demonstrates remarkable consistency across multiple languages.
arXiv Detail & Related papers (2024-08-30T05:42:31Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on
African Sentiment Analysis [5.945320097465418]
We tackle the task of sentiment analysis in 14 different African languages.
We develop both monolingual and multilingual models under a full supervised setting.
Our results demonstrate the effectiveness of transfer learning and fine-tuning techniques for sentiment analysis.
arXiv Detail & Related papers (2023-04-21T21:25:14Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Low-Resource Language Modelling of South African Languages [6.805575417034369]
We evaluate the performance of open-vocabulary language models on low-resource South African languages.
We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs) and Transformers on small-scale datasets.
Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets.
arXiv Detail & Related papers (2021-04-01T21:27:27Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.