Related papers: BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

URL: http://arxiv.org/abs/2601.20129v1
Date: Tue, 27 Jan 2026 23:47:46 GMT
Title: BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification
Authors: Akif Islam, Sujan Kumar Roy, Md. Ekramul Hamid,
Abstract summary: We introduce BengaliSent140, a large-scale binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus.<n>The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances.<n>By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage.
Score: 0.1784233255402269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at https://www.kaggle.com/datasets/akifislam/bengalisent140/

Related papers

BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference [1.7688536690159165]
Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity.<n>We introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling.<n>We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations.
arXiv Detail & Related papers (2025-11-11T22:29:14Z)
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis [0.08246494848934446]
We introduce BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites.<n>We achieve an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks.
arXiv Detail & Related papers (2024-08-16T18:30:22Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B. We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively. We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages. Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages. We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z)
Hate Speech detection in the Bengali language: A dataset and its baseline evaluation [0.8793721044482612]
This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation.
arXiv Detail & Related papers (2020-12-17T15:53:54Z)
Sentiment analysis in Bengali via transfer learning using multi-lingual BERT [0.9883261192383611]
In this paper, we present manually tagged 2-class and 3-class SA datasets in Bengali. We also demonstrate that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning. This deep learning model achieves an accuracy of 71% for 2-class sentiment classification compared to the current state-of-the-art accuracy of 68%.
arXiv Detail & Related papers (2020-12-03T10:21:11Z)
BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and it's baseline evaluation [0.8793721044482612]
We present a manually annotated Bengali dataset of high quality, BAN-ABSA, which is annotated with aspect and its associated sentiment by 3 native Bengali speakers. The dataset consists of 2,619 positive, 4,721 negative and 1,669 neutral data samples from 9,009 unique comments gathered from some famous Bengali news portals.
arXiv Detail & Related papers (2020-12-01T06:09:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.