Related papers: Hate Speech and Offensive Language Detection in Bengali

Hate Speech and Offensive Language Detection in Bengali

URL: http://arxiv.org/abs/2210.03479v1
Date: Fri, 7 Oct 2022 12:06:04 GMT
Title: Hate Speech and Offensive Language Detection in Bengali
Authors: Mithun Das, Somnath Banerjee, Punyajoy Saha, Animesh Mukherjee
Abstract summary: We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We also explore the interlingual transfer mechanism to boost classification performance.
Score: 5.765076125746209
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research's limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.

Related papers

A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities [43.37824420609252]
Hate speech online remains an understudied issue for marginalized communities. In this paper, we aim to provide marginalized communities with a privacy-preserving tool to protect themselves from online hate speech.
arXiv Detail & Related papers (2024-12-06T11:00:05Z)
Analysis and Detection of Multilingual Hate Speech Using Transformer Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing. In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc. The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim. The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z)
Lost in Translation, Found in Spans: Identifying Claims in Multilingual Social Media [40.26888469822391]
Claim span identification (CSI) is an important step in fact-checking pipelines. Despite its importance to journalists and human fact-checkers, it remains a severely understudied problem. We create a novel dataset, X-CLAIM, consisting of 7K real-world claims collected from numerous social media platforms in five Indian languages and English.
arXiv Detail & Related papers (2023-10-27T15:28:12Z)
Understanding writing style in social media with a supervised contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation. We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts. Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z)
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z)
Analyzing Norm Violations in Live-Stream Chat [49.120561596550395]
We study the first NLP study dedicated to detecting norm violations in conversations on live-streaming platforms. We define norm violation categories in live-stream chats and annotate 4,583 moderated comments from Twitch. Our results show that appropriate contextual information can boost moderation performance by 35%.
arXiv Detail & Related papers (2023-05-18T05:58:27Z)
Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems. This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
Bangla hate speech detection on social media using attention-based recurrent neural network [2.1349209400003932]
This article proposed encoder decoder based machine learning model, a popular tool in NLP, to classify user's Bengali comments on Facebook pages. A dataset of 7,425 Bengali comments, consisting of seven distinct categories of hate speeches, was used to train and evaluate our model. Among the three encoder decoder algorithms, the attention-based decoder obtained the best accuracy (77%)
arXiv Detail & Related papers (2022-03-31T03:31:53Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages. Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages. We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z)
Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language. The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset. We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z)
Hate Speech detection in the Bengali language: A dataset and its baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z)
Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText. We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.