Sentiment Classification in Swahili Language Using Multilingual BERT
- URL: http://arxiv.org/abs/2104.09006v1
- Date: Mon, 19 Apr 2021 01:47:00 GMT
- Title: Sentiment Classification in Swahili Language Using Multilingual BERT
- Authors: Gati L. Martin, Medard E. Mswahili, Young-Seob Jeong
- Abstract summary: This study uses the current state-of-the-art model, multilingual BERT, to perform sentiment classification on Swahili datasets.
The data was created by extracting and annotating 8.2k reviews and comments on different social media platforms and the ISEAR emotion dataset.
The model was fine-tuned and achieve the best accuracy of 87.59%.
- Score: 0.04297070083645048
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The evolution of the Internet has increased the amount of information that is
expressed by people on different platforms. This information can be product
reviews, discussions on forums, or social media platforms. Accessibility of
these opinions and peoples feelings open the door to opinion mining and
sentiment analysis. As language and speech technologies become more advanced,
many languages have been used and the best models have been obtained. However,
due to linguistic diversity and lack of datasets, African languages have been
left behind. In this study, by using the current state-of-the-art model,
multilingual BERT, we perform sentiment classification on Swahili datasets. The
data was created by extracting and annotating 8.2k reviews and comments on
different social media platforms and the ISEAR emotion dataset. The data were
classified as either positive or negative. The model was fine-tuned and achieve
the best accuracy of 87.59%.
Related papers
- From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets [10.264294331399434]
Hate speech datasets have traditionally been developed by language.
We evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography.
We find that HS datasets for English, Arabic and Spanish exhibit a strong geo-cultural bias.
arXiv Detail & Related papers (2024-04-27T12:10:10Z) - Ensemble Language Models for Multilingual Sentiment Analysis [0.0]
We explore sentiment analysis on tweet texts from SemEval-17 and the Arabic Sentiment Tweet dataset.
Our findings include monolingual models exhibiting superior performance and ensemble models outperforming the baseline.
arXiv Detail & Related papers (2024-03-10T01:39:10Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Do All Languages Cost the Same? Tokenization in the Era of Commercial
Language Models [68.29126169579132]
API vendors charge their users based on usage, more specifically on the number of tokens'' processed or generated by the underlying language models.
What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages.
We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages.
arXiv Detail & Related papers (2023-05-23T05:46:45Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.