Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of
Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media
- URL: http://arxiv.org/abs/2312.10528v1
- Date: Sat, 16 Dec 2023 19:59:07 GMT
- Title: Cross-Linguistic Offensive Language Detection: BERT-Based Analysis of
Bengali, Assamese, & Bodo Conversational Hateful Content from Social Media
- Authors: Jhuma Kabir Mim, Mourad Oussalah, Akash Singhal
- Abstract summary: This article delves into the comprehensive results and key revelations from the HASOC-2023 offensive language identification result.
The primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of Bengali, Assamese, and Bodo.
In this work, we used BERT models, including XML-Roberta, L3-cube, IndicBERT, BenglaBERT, and BanglaHateBERT.
- Score: 0.8287206589886881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In today's age, social media reigns as the paramount communication platform,
providing individuals with the avenue to express their conjectures,
intellectual propositions, and reflections. Unfortunately, this freedom often
comes with a downside as it facilitates the widespread proliferation of hate
speech and offensive content, leaving a deleterious impact on our world. Thus,
it becomes essential to discern and eradicate such offensive material from the
realm of social media. This article delves into the comprehensive results and
key revelations from the HASOC-2023 offensive language identification result.
The primary emphasis is placed on the meticulous detection of hate speech
within the linguistic domains of Bengali, Assamese, and Bodo, forming the
framework for Task 4: Annihilate Hates. In this work, we used BERT models,
including XML-Roberta, L3-cube, IndicBERT, BenglaBERT, and BanglaHateBERT. The
research outcomes were promising and showed that XML-Roberta-lagre performed
better than monolingual models in most cases. Our team 'TeamBD' achieved rank
3rd for Task 4 - Assamese, & 5th for Bengali.
Related papers
- Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue [67.09698638709065]
We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE.
In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised.
We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
arXiv Detail & Related papers (2024-02-06T03:14:46Z) - Analysis and Detection of Multilingual Hate Speech Using Transformer
Based Deep Learning [7.332311991395427]
As the prevalence of hate speech increases online, the demand for automated detection as an NLP task is increasing.
In this work, the proposed method is using transformer-based model to detect hate speech in social media, like twitter, Facebook, WhatsApp, Instagram, etc.
The Gold standard datasets were collected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel, and Rezaul Karim.
The success rate of the proposed model for hate speech detection is higher than the existing baseline and state-of-the-art models with accuracy in Bengali dataset is 89%, in English: 91%, in German
arXiv Detail & Related papers (2024-01-19T20:40:23Z) - Harnessing Pre-Trained Sentence Transformers for Offensive Language
Detection in Indian Languages [0.6526824510982802]
This work delves into the domain of hate speech detection, placing specific emphasis on three low-resource Indian languages: Bengali, Assamese, and Gujarati.
The challenge is framed as a text classification task, aimed at discerning whether a tweet contains offensive or non-offensive content.
We fine-tuned pre-trained BERT and SBERT models to evaluate their effectiveness in identifying hate speech.
arXiv Detail & Related papers (2023-10-03T17:53:09Z) - Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language.
To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset.
Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z) - Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets.
We implement several baseline models for the classification of such hateful posts.
We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced
Bengali Language [1.2246649738388389]
We propose an explainable approach for hate speech detection from the under-resourced Bengali language.
In our approach, Bengali texts are first comprehensively preprocessed, before classifying them into political, personal, geopolitical, and religious hates.
Evaluations against machine learning (linear and tree-based models) and deep neural networks (i.e., CNN, Bi-LSTM, and Conv-LSTM with word embeddings) baselines yield F1 scores of 84%, 90%, 88%, and 88%, for political, personal, geopolitical, and religious hates, respectively.
arXiv Detail & Related papers (2020-12-28T16:46:03Z) - Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts.
All the comments are collected from YouTube and Facebook comment section and classified into seven categories.
A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.