HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech
Detection in Bangla
- URL: http://arxiv.org/abs/2112.01902v1
- Date: Fri, 3 Dec 2021 13:35:18 GMT
- Title: HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech
Detection in Bangla
- Authors: Nauros Romim, Mosahed Ahmed, Md Saiful Islam, Arnab Sen Sharma,
Hriteshwar Talukder, Mohammad Ruhul Amin
- Abstract summary: In this paper, we present HS-BAN, a binary class hate speech dataset in Bangla language consisting of more than 50,000 labeled comments.
We explore traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection.
Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score.
- Score: 2.055204980188575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in
Bangla language consisting of more than 50,000 labeled comments, including
40.17% hate and rest are non hate speech. While preparing the dataset a strict
and detailed annotation guideline was followed to reduce human annotation bias.
The HS dataset was also preprocessed linguistically to extract different types
of slang currently people write using symbols, acronyms, or alternative
spellings. These slang words were further categorized into traditional and
non-traditional slang lists and included in the results of this paper. We
explored traditional linguistic features and neural network-based methods to
develop a benchmark system for hate speech detection for the Bangla language.
Our experimental results show that existing word embedding models trained with
informal texts perform better than those trained with formal text. Our
benchmark shows that a Bi-LSTM model on top of the FastText informal word
embedding achieved 86.78% F1-score. We will make the dataset available for
public use.
Related papers
- BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla [0.0]
We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples.
The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups.
Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset.
arXiv Detail & Related papers (2024-10-17T07:15:15Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Spread Love Not Hate: Undermining the Importance of Hateful Pre-training
for Hate Speech Detection [0.7874708385247353]
We study the effects of hateful pre-training on low resource hate speech classification tasks.
We evaluate different variations of tweet based BERT models pre-trained on hateful, non-hateful and mixed subsets of 40M tweet dataset.
We show that pre-training on non-hateful text from target domain provides similar or better results.
arXiv Detail & Related papers (2022-10-09T13:53:06Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts [1.5483942282713241]
This paper introduces a large manually labeled dataset that includes Hate Speech in different social contexts.
The dataset includes more than 50,200 offensive comments crawled from online social networking sites.
In experiments, we found that a word embedding trained exclusively using 1.47 million comments consistently resulted in better modeling of HS detection.
arXiv Detail & Related papers (2022-06-01T10:10:15Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.