Detecting Racist Text in Bengali: An Ensemble Deep Learning Framework
- URL: http://arxiv.org/abs/2401.16748v1
- Date: Tue, 30 Jan 2024 04:56:55 GMT
- Title: Detecting Racist Text in Bengali: An Ensemble Deep Learning Framework
- Authors: S. S. Saruar, Nusrat, Sadia
- Abstract summary: Racism is an alarming phenomenon in our country as well as all over the world.
We have built a novel dataset in the Bengali Language.
We have successfully achieved text detection with an impressive accuracy rate of 87.94%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Racism is an alarming phenomenon in our country as well as all over the
world. Every day we have come across some racist comments in our daily life and
virtual life. Though we can eradicate this racism from virtual life (such as
Social Media). In this paper, we have tried to detect those racist comments
with NLP and deep learning techniques. We have built a novel dataset in the
Bengali Language. Further, we annotated the dataset and conducted data label
validation. After extensive utilization of deep learning methodologies, we have
successfully achieved text detection with an impressive accuracy rate of
87.94\% using the Ensemble approach. We have applied RNN and LSTM models using
BERT Embeddings. However, the MCNN-LSTM model performed highest among all those
models. Lastly, the Ensemble approach has been followed to combine all the
model results to increase overall performance.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts [1.5483942282713241]
This paper introduces a large manually labeled dataset that includes Hate Speech in different social contexts.
The dataset includes more than 50,200 offensive comments crawled from online social networking sites.
In experiments, we found that a word embedding trained exclusively using 1.47 million comments consistently resulted in better modeling of HS detection.
arXiv Detail & Related papers (2022-06-01T10:10:15Z) - Multimodal Hate Speech Detection from Bengali Memes and Texts [0.6709991492637819]
This paper is about hate speech detection from multimodal Bengali memes and texts.
We train several neural networks to analyze textual and visual information for hate speech detection.
Our study suggests that memes are moderately useful for hate speech detection in Bengali, but none of the multimodal models outperform unimodal models.
arXiv Detail & Related papers (2022-04-19T11:15:25Z) - Virtual Data Augmentation: A Robust and General Framework for
Fine-tuning Pre-trained Models [51.46732511844122]
Powerful pre-trained language models (PLM) can be fooled by small perturbations or intentional attacks.
We present Virtual Data Augmentation (VDA), a general framework for robustly fine-tuning PLMs.
Our approach is able to improve the robustness of PLMs and alleviate the performance degradation under adversarial attacks.
arXiv Detail & Related papers (2021-09-13T09:15:28Z) - SN Computer Science: Towards Offensive Language Identification for Tamil
Code-Mixed YouTube Comments and Posts [2.0305676256390934]
This study presents extensive experiments using multiple deep learning, and transfer learning models to detect offensive content on YouTube.
We propose a novel and flexible approach of selective translation and transliteration techniques to reap better results from fine-tuning and ensembling multilingual transformer networks.
The proposed model ULMFiT and mBERTBiLSTM yielded good results and are promising for effective offensive speech identification in low-resourced languages.
arXiv Detail & Related papers (2021-08-24T20:23:30Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - hBert + BiasCorp -- Fighting Racism on the Web [58.768804813646334]
We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.
In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer.
We are also releasing a JavaScript library and a Chrome Extension Application, to help developers make use of our trained model in web applications.
arXiv Detail & Related papers (2021-04-06T02:17:20Z) - Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language.
The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset.
We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Detecting White Supremacist Hate Speech using Domain Specific Word
Embedding with Deep Learning and BERT [0.0]
White supremacist hate speech is one of the most recently observed harmful content on social media.
This research investigates the viability of automatically detecting white supremacist hate speech on Twitter by using deep learning and natural language processing techniques.
arXiv Detail & Related papers (2020-10-01T12:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.