Related papers: Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media

Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media

URL: http://arxiv.org/abs/2312.10528v1
Date: Sat, 16 Dec 2023 19:59:07 GMT
Title: Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media
Authors: Jhuma Kabir Mim, Mourad Oussalah, Akash Singhal
Abstract summary: This article delves into the comprehensive results and key revelations from the HASOC-2023 offensive language identification result. The primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of Bengali, Assamese, and Bodo. In this work, we used BERT models, including XML-Roberta, L3-cube, IndicBERT, BenglaBERT, and BanglaHateBERT.
Score: 0.8287206589886881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In today's age, social media reigns as the paramount communication platform, providing individuals with the avenue to express their conjectures, intellectual propositions, and reflections. Unfortunately, this freedom often comes with a downside as it facilitates the widespread proliferation of hate speech and offensive content, leaving a deleterious impact on our world. Thus, it becomes essential to discern and eradicate such offensive material from the realm of social media. This article delves into the comprehensive results and key revelations from the HASOC-2023 offensive language identification result. The primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of Bengali, Assamese, and Bodo, forming the framework for Task 4: Annihilate Hates. In this work, we used BERT models, including XML-Roberta, L3-cube, IndicBERT, BenglaBERT, and BanglaHateBERT. The research outcomes were promising and showed that XML-Roberta-lagre performed better than monolingual models in most cases. Our team 'TeamBD' achieved rank 3rd for Task 4 - Assamese, & 5th for Bengali.

Related papers

BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset [0.0]
This study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset.<n>It was constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects.<n>The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla.
arXiv Detail & Related papers (2025-07-22T02:53:48Z)
A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities [43.37824420609252]
Hate speech online remains an understudied issue for marginalized communities. In this paper, we aim to provide marginalized communities with a privacy-preserving tool to protect themselves from online hate speech.
arXiv Detail & Related papers (2024-12-06T11:00:05Z)
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue [67.09698638709065]
We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
arXiv Detail & Related papers (2024-02-06T03:14:46Z)
Analysis and Detection of Multilingual Hate Speech Using Transformer Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing. In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc. The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim. The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z)
Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati. The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content. We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z)
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z)
Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z)
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language. We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening. For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z)
Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language. We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z)
DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced Bengali Language [1.2246649738388389]
We propose an explainable approach for hate speech detection from the under-resourced Bengali language. In our approach, Bengali texts are first comprehensively preprocessed, before classifying them into political, personal, geopolitical, and religious hates. Evaluations against machine learning (linear and tree-based models) and deep neural networks (i.e., CNN, Bi-LSTM, and Conv-LSTM with word embeddings) baselines yield F1 scores of 84%, 90%, 88%, and 88%, for political, personal, geopolitical, and religious hates, respectively.
arXiv Detail & Related papers (2020-12-28T16:46:03Z)
Hate Speech detection in the Bengali language: A dataset and its baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z)
Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities. We study the evolution and spread of anti-Asian hate speech through the lens of Twitter. We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.