NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis
- URL: http://arxiv.org/abs/2201.08277v1
- Date: Thu, 20 Jan 2022 16:28:06 GMT
- Title: NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis
- Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Ibrahim Said
Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye
Emezue, Anuoluwapo Aremu, Saheed Abdul, Pavel Brazdil
- Abstract summary: We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
- Score: 5.048355865260207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but
most work focuses on languages with large amounts of data. We introduce the
first large-scale human-annotated Twitter sentiment dataset for the four most
widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yoruba)
consisting of around 30,000 annotated tweets per language (except for
Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We
propose text collection, filtering, processing, and labelling methods that
enable us to create datasets for these low-resource languages. We evaluate a
range of pre-trained models and transfer strategies on the dataset. We find
that language-specific models and language-adaptive fine-tuning generally
perform best. We release the datasets, trained models, sentiment lexicons, and
code to incentivize research on sentiment analysis in under-represented
languages.
Related papers
- Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages [0.0]
We focus on ABSA subtasks based on Twitter/X data in underrepresented languages.
We fine-tune several LLMs for classification of sentiment towards Russia and Ukraine.
We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others.
arXiv Detail & Related papers (2024-08-04T14:35:30Z) - A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo.
We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers.
We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - Multilingual transfer of acoustic word embeddings improves when training
on languages related to the target zero-resource language [32.170748231414365]
We show that training on even just a single related language gives the largest gain.
We also find that adding data from unrelated languages generally doesn't hurt performance.
arXiv Detail & Related papers (2021-06-24T08:37:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.