Bangla Text Dataset and Exploratory Analysis for Online Harassment
Detection
- URL: http://arxiv.org/abs/2102.02478v1
- Date: Thu, 4 Feb 2021 08:35:18 GMT
- Title: Bangla Text Dataset and Exploratory Analysis for Online Harassment
Detection
- Authors: Md Faisal Ahmed, Zalish Mahmud, Zarin Tasnim Biash, Ahmed Ann Noor
Ryen, Arman Hossain, Faisal Bin Ashraf
- Abstract summary: The data that has been made accessible in this article has been gathered and marked from the comments of people in public posts by celebrities, government officials, athletes on Facebook.
The dataset is compiled with the aim of developing the ability of machines to differentiate whether a comment is a bully expression or not.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Being the seventh most spoken language in the world, the use of the Bangla
language online has increased in recent times. Hence, it has become very
important to analyze Bangla text data to maintain a safe and harassment-free
online place. The data that has been made accessible in this article has been
gathered and marked from the comments of people in public posts by celebrities,
government officials, athletes on Facebook. The total amount of collected
comments is 44001. The dataset is compiled with the aim of developing the
ability of machines to differentiate whether a comment is a bully expression or
not with the help of Natural Language Processing and to what extent it is
improper if it is an inappropriate comment. The comments are labeled with
different categories of harassment. Exploratory analysis from different
perspectives is also included in this paper to have a detailed overview. Due to
the scarcity of data collection of categorized Bengali language comments, this
dataset can have a significant role for research in detecting bully words,
identifying inappropriate comments, detecting different categories of Bengali
bullies, etc. The dataset is publicly available at
https://data.mendeley.com/datasets/9xjx8twk8p.
Related papers
- The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - What Evidence Do Language Models Find Convincing? [94.90663008214918]
We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts.
We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions.
Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
arXiv Detail & Related papers (2024-02-19T02:15:34Z) - Textual Toxicity in Social Media: Understanding the Bangla Toxic
Language Expressed in Facebook Comment [0.6798775532273751]
The toxic language/script used by the Bengali community as cyberbullying, hate speech and moral policing became major trends in social media culture in Bangladesh and West Bengal.
It is assumed that this analysis will reinforce the detection of Bangla's toxic language used in social media and thus cure this virtual disease.
arXiv Detail & Related papers (2023-12-09T05:04:34Z) - Hate Speech and Offensive Language Detection in Bengali [5.765076125746209]
We develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets.
We implement several baseline models for the classification of such hateful posts.
We also explore the interlingual transfer mechanism to boost classification performance.
arXiv Detail & Related papers (2022-10-07T12:06:04Z) - BanglaSarc: A Dataset for Sarcasm Detection [0.3914676152740142]
Sarcasm is a positive statement or remark with an underlying negative motivation that is extensively employed in today's social media platforms.
There has been a significant improvement in sarcasm detection in English over the previous many years, however the situation regarding Bangla sarcasm detection remains unchanged.
This article proposes BanglaSarc, a dataset constructed specifically for bangla textual data sarcasm detection.
arXiv Detail & Related papers (2022-09-27T15:28:21Z) - Beyond Plain Toxic: Detection of Inappropriate Statements on Flammable
Topics for the Russian Language [76.58220021791955]
We present two text collections labelled according to binary notion of inapropriateness and a multinomial notion of sensitive topic.
To objectivise the notion of inappropriateness, we define it in a data-driven way though crowdsourcing.
arXiv Detail & Related papers (2022-03-04T15:59:06Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - Hate Speech detection in the Bengali language: A dataset and its
baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts.
All the comments are collected from YouTube and Facebook comment section and classified into seven categories.
A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Creating a Multimodal Dataset of Images and Text to Study Abusive
Language [2.2688530041645856]
CREENDER is an annotation tool that has been used in school classes to create a multimodal dataset of images and abusive comments.
The corpus, with Italian comments, has been analysed from different perspectives to investigate whether the subject of the images plays a role in triggering a comment.
We find that users judge the same images in different ways, although the presence of a person in the picture increases the probability to get an offensive comment.
arXiv Detail & Related papers (2020-05-05T14:31:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.