Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation
- URL: http://arxiv.org/abs/2012.09686v1
- Date: Thu, 17 Dec 2020 15:53:54 GMT
- Title: Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation
- Authors: Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, Md Saiful Islam
- Abstract summary: This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts.
All the comments are collected from YouTube and Facebook comment section and classified into seven categories.
A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
- Score: 0.8793721044482612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media sites such as YouTube and Facebook have become an integral part
of everyone's life and in the last few years, hate speech in the social media
comment section has increased rapidly. Detection of hate speech on social media
websites faces a variety of challenges including small imbalanced data sets,
the findings of an appropriate model and also the choice of feature analysis
method. further more, this problem is more severe for the Bengali speaking
community due to the lack of gold standard labelled datasets. This paper
presents a new dataset of 30,000 user comments tagged by crowd sourcing and
varified by experts. All the comments are collected from YouTube and Facebook
comment section and classified into seven categories: sports, entertainment,
religion, politics, crime, celebrity and TikTok & meme. A total of 50
annotators annotated each comment three times and the majority vote was taken
as the final annotation. Nevertheless, we have conducted base line experiments
and several deep learning models along with extensive pre-trained Bengali word
embedding such as Word2Vec, FastText and BengFastText on this dataset to
facilitate future research opportunities. The experiment illustrated that
although all deep learning models performed well, SVM achieved the best result
with 87.5% accuracy. Our core contribution is to make this benchmark dataset
available and accessible to facilitate further research in the field of in the
field of Bengali hate speech detection.
Related papers
- Hate Speech Detection and Classification in Amharic Text with Deep Learning [4.834669033093363]
We develop Amharic hate speech data and SBi-LSTM deep learning model that can detect and classify text into four categories of hate speech.
We have annotated 5k Amharic social media post and comment data into four categories.
The model achieves a 94.8 F1-score performance.
arXiv Detail & Related papers (2024-08-07T15:46:45Z) - The Evolution of Language in Social Media Comments [37.69303106863453]
This study investigates the linguistic characteristics of user comments over 34 years, focusing on their complexity and temporal shifts.
We utilize a dataset of approximately 300 million English comments from eight diverse platforms and topics.
Our findings reveal consistent patterns of complexity across social media platforms and topics, characterized by a nearly universal reduction in text length, diminished lexical richness, but decreased repetitiveness.
arXiv Detail & Related papers (2024-06-17T12:03:30Z) - Analysis and Detection of Multilingual Hate Speech Using Transformer
Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing.
In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc.
The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim.
The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Analyzing Norm Violations in Live-Stream Chat [49.120561596550395]
We study the first NLP study dedicated to detecting norm violations in conversations on live-streaming platforms.
We define norm violation categories in live-stream chats and annotate 4,583 moderated comments from Twitch.
Our results show that appropriate contextual information can boost moderation performance by 35%.
arXiv Detail & Related papers (2023-05-18T05:58:27Z) - Data-Efficient Strategies for Expanding Hate Speech Detection into
Under-Resourced Languages [35.185808055004344]
Most hate speech datasets so far focus on English-language content.
More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators.
We explore data-efficient strategies for expanding hate speech detection into under-resourced languages.
arXiv Detail & Related papers (2022-10-20T15:49:00Z) - Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets.
We implement several baseline models for the classification of such hateful posts.
We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - WordBias: An Interactive Visual Tool for Discovering Intersectional
Biases Encoded in Word Embeddings [39.87681037622605]
We present WordBias, an interactive visual tool designed to explore biases against intersectional groups encoded in word embeddings.
Given a pretrained static word embedding, WordBias computes the association of each word along different groups based on race, age, etc.
arXiv Detail & Related papers (2021-03-05T11:04:35Z) - Bangla Text Dataset and Exploratory Analysis for Online Harassment
Detection [0.0]
The data that has been made accessible in this article has been gathered and marked from the comments of people in public posts by celebrities, government officials, athletes on Facebook.
The dataset is compiled with the aim of developing the ability of machines to differentiate whether a comment is a bully expression or not.
arXiv Detail & Related papers (2021-02-04T08:35:18Z) - Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.