Related papers: BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

URL: http://arxiv.org/abs/2305.06595v3
Date: Thu, 8 Jun 2023 08:57:41 GMT
Title: BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews
Authors: Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud and Md Kamrul Hasan
Abstract summary: We present a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features.
Score: 1.869097450593631
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.

Related papers

Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT [1.5020330976600738]
We develop a novel approach that integrates rule-based algorithms with pre-trained language models. We developed a novel rule based algorithm Bangla Sentiment Polarity Score ( BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories. Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories.
arXiv Detail & Related papers (2024-11-29T09:57:11Z)
Learning from Neighbors: Category Extrapolation for Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z)
Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z)
Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis [6.471458199049549]
In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios.
arXiv Detail & Related papers (2023-08-21T15:19:10Z)
On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings. Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank. It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries. It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Scaling Federated Learning for Fine-tuning of Large Language Models [0.5405981353784006]
Federated learning (FL) is a promising approach to distributed compute, as well as distributed data, and provides a level of privacy and compliance to legal frameworks. In this paper, we explore the fine-tuning of Transformer-based language models in a federated learning setting. We perform an extensive sweep over the number of clients, ranging up to 32, to evaluate the impact of distributed compute on task performance.
arXiv Detail & Related papers (2021-02-01T14:31:39Z)
BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and it's baseline evaluation [0.8793721044482612]
We present a manually annotated Bengali dataset of high quality, BAN-ABSA, which is annotated with aspect and its associated sentiment by 3 native Bengali speakers. The dataset consists of 2,619 positive, 4,721 negative and 1,669 neutral data samples from 9,009 unique comments gathered from some famous Bengali news portals.
arXiv Detail & Related papers (2020-12-01T06:09:44Z)
Sentiment Classification in Bangla Textual Content: A Comparative Study [4.2394281761764]
In this study, we explore several publicly available sentiment labeled datasets and designed classifiers using both classical and deep learning algorithms. Our finding suggests transformer-based models, which have not been explored earlier for Bangla, outperform all other models.
arXiv Detail & Related papers (2020-11-19T21:06:28Z)
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets. The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.