ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages
- URL: http://arxiv.org/abs/2506.21686v1
- Date: Thu, 26 Jun 2025 18:13:54 GMT
- Title: ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages
- Authors: Swastika Kundu, Autoshi Ibrahim, Mithila Rahman, Tanvir Ahmed,
- Abstract summary: ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects.<n>The dataset features political and religious content, reflecting the contemporary socio political landscape of Bangladesh.<n>The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies.
- Score: 0.5062312533373298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentiment analysis for regional dialects of Bangla remains an underexplored area due to linguistic diversity and limited annotated data. This paper introduces ANUBHUTI, a comprehensive dataset consisting of 2000 sentences manually translated from standard Bangla into four major regional dialects Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly features political and religious content, reflecting the contemporary socio political landscape of Bangladesh, alongside neutral texts to maintain balance. Each sentence is annotated using a dual annotation scheme: multiclass thematic labeling categorizes sentences as Political, Religious, or Neutral, and multilabel emotion annotation assigns one or more emotions from Anger, Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native translators conducted the translation and annotation, with quality assurance performed via Cohens Kappa inter annotator agreement, achieving strong consistency across dialects. The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects, enabling more accurate and context aware natural language processing.
Related papers
- BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset [0.0]
This study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset.<n>It was constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects.<n>The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla.
arXiv Detail & Related papers (2025-07-22T02:53:48Z) - Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z) - BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla [0.0]
We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset comprising 37.3k samples.<n>The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups.<n>Experiments reveal that our further pre-trained encoders are achieving state-of-the-art performance on the BanTH dataset.
arXiv Detail & Related papers (2024-10-17T07:15:15Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Content-Localization based System for Analyzing Sentiment and Hate
Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf [5.2957928879391]
This paper proposes to localize content of resources in high-resourced languages into under-resourced Arabic dialects.
We utilize content-localization based neural machine translation to develop sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine and Gulf.
Our findings shed light on the importance of considering the unique nature of dialects within the same language and ignoring the dialectal aspect would lead to misleading analysis.
arXiv Detail & Related papers (2023-11-27T15:37:33Z) - BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer [3.1742013359102175]
We propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer for Bangla.
Our system aims to lemmatize words based on their parts of speech class within a given sentence.
The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained.
arXiv Detail & Related papers (2023-11-06T13:02:07Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis
Dataset and its Evaluation [0.9894420655516565]
SentiGOLD adheres to established linguistic conventions agreed upon by the Government of Bangladesh and a Bangla linguistics committee.
The dataset incorporates data from online video comments, social media posts, blogs, news, and other sources while maintaining domain and class distribution rigorously.
The top model achieves a macro f1 score of 0.62 (intra-dataset) across 5 classes, setting a benchmark, and 0.61 (cross-dataset from SentNoB) across 3 classes, comparable to the state-of-the-art.
arXiv Detail & Related papers (2023-06-09T12:07:10Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.