BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts
- URL: http://arxiv.org/abs/2206.00372v1
- Date: Wed, 1 Jun 2022 10:10:15 GMT
- Title: BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts
- Authors: Nauros Romim, Mosahed Ahmed, Md. Saiful Islam, Arnab Sen Sharma,
Hriteshwar Talukder, Mohammad Ruhul Amin
- Abstract summary: This paper introduces a large manually labeled dataset that includes Hate Speech in different social contexts.
The dataset includes more than 50,200 offensive comments crawled from online social networking sites.
In experiments, we found that a word embedding trained exclusively using 1.47 million comments consistently resulted in better modeling of HS detection.
- Score: 1.5483942282713241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media platforms and online streaming services have spawned a new breed
of Hate Speech (HS). Due to the massive amount of user-generated content on
these sites, modern machine learning techniques are found to be feasible and
cost-effective to tackle this problem. However, linguistically diverse datasets
covering different social contexts in which offensive language is typically
used are required to train generalizable models. In this paper, we identify the
shortcomings of existing Bangla HS datasets and introduce a large manually
labeled dataset BD-SHS that includes HS in different social contexts. The
labeling criteria were prepared following a hierarchical annotation process,
which is the first of its kind in Bangla HS to the best of our knowledge. The
dataset includes more than 50,200 offensive comments crawled from online social
networking sites and is at least 60% larger than any existing Bangla HS
datasets. We present the benchmark result of our dataset by training different
NLP models resulting in the best one achieving an F1-score of 91.0%. In our
experiments, we found that a word embedding trained exclusively using 1.47
million comments from social media and streaming sites consistently resulted in
better modeling of HS detection in comparison to other pre-trained embeddings.
Our dataset and all accompanying codes is publicly available at
github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media
Related papers
- Offensive Language Identification in Transliterated and Code-Mixed
Bangla [29.30985521838655]
In this paper, we explore offensive language identification in texts with transliterations and code-mixing.
We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments.
We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset.
arXiv Detail & Related papers (2023-11-25T13:27:22Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services [5.03606775899383]
"KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
arXiv Detail & Related papers (2023-10-06T15:19:39Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - BERT-based Ensemble Approaches for Hate Speech Detection [1.8734449181723825]
This paper focuses on classifying hate speech in social media using multiple deep models.
We evaluated with several ensemble techniques, including soft voting, maximum value, hard voting and stacking.
Experiments have shown good results especially the ensemble models, where stacking gave F1 score of 97% on Davidson dataset and aggregating ensembles 77% on the DHO dataset.
arXiv Detail & Related papers (2022-09-14T09:08:24Z) - HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech
Detection in Bangla [2.055204980188575]
In this paper, we present HS-BAN, a binary class hate speech dataset in Bangla language consisting of more than 50,000 labeled comments.
We explore traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection.
Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score.
arXiv Detail & Related papers (2021-12-03T13:35:18Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - hBert + BiasCorp -- Fighting Racism on the Web [58.768804813646334]
We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.
In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer.
We are also releasing a JavaScript library and a Chrome Extension Application, to help developers make use of our trained model in web applications.
arXiv Detail & Related papers (2021-04-06T02:17:20Z) - Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts.
All the comments are collected from YouTube and Facebook comment section and classified into seven categories.
A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.