Hate Speech Detection and Classification in Amharic Text with Deep Learning
- URL: http://arxiv.org/abs/2408.03849v1
- Date: Wed, 7 Aug 2024 15:46:45 GMT
- Title: Hate Speech Detection and Classification in Amharic Text with Deep Learning
- Authors: Samuel Minale Gashe, Seid Muhie Yimam, Yaregal Assabie,
- Abstract summary: We develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech.
We have annotated 5k Amharic social media post and comment data into four categories.
The model achieves a 94.8 F1-score performance.
- Score: 4.834669033093363
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Hate speech is a growing problem on social media. It can seriously impact society, especially in countries like Ethiopia, where it can trigger conflicts among diverse ethnic and religious groups. While hate speech detection in resource rich languages are progressing, for low resource languages such as Amharic are lacking. To address this gap, we develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech: racial, religious, gender, and non-hate speech. We have annotated 5k Amharic social media post and comment data into four categories. The data is annotated using a custom annotation tool by a total of 100 native Amharic speakers. The model achieves a 94.8 F1-score performance. Future improvements will include expanding the dataset and develop state-of-the art models. Keywords: Amharic hate speech detection, classification, Amharic dataset, Deep Learning, SBi-LSTM
Related papers
- Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts.
The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate.
The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z) - Analysis and Detection of Multilingual Hate Speech Using Transformer
Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing.
In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc.
The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim.
The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Detection of Hate Speech using BERT and Hate Speech Word Embedding with
Deep Model [0.5801044612920815]
This paper investigates the feasibility of leveraging domain-specific word embedding in Bidirectional LSTM based deep model to automatically detect/classify hate speech.
The experiments showed that domainspecific word embedding with the Bidirectional LSTM based deep model achieved a 93% f1-score while BERT achieved up to 96% f1-score on a combined balanced dataset from available hate speech datasets.
arXiv Detail & Related papers (2021-11-02T11:42:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - One to rule them all: Towards Joint Indic Language Hate Speech Detection [7.296361860015606]
We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection.
On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B.
arXiv Detail & Related papers (2021-09-28T13:30:00Z) - Hate Speech Detection in Roman Urdu [1.6436293069942314]
This study is the first to conduct a study for hate speech detection in Roman Urdu text.
We have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets.
We have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus.
arXiv Detail & Related papers (2021-08-05T19:49:46Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.