BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from
Book Reviews
- URL: http://arxiv.org/abs/2305.06595v3
- Date: Thu, 8 Jun 2023 08:57:41 GMT
- Title: BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from
Book Reviews
- Authors: Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud and
Md Kamrul Hasan
- Abstract summary: We present a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral.
We employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT.
Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features.
- Score: 1.869097450593631
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The analysis of consumer sentiment, as expressed through reviews, can provide
a wealth of insight regarding the quality of a product. While the study of
sentiment analysis has been widely explored in many popular languages,
relatively less attention has been given to the Bangla language, mostly due to
a lack of relevant data and cross-domain adaptability. To address this
limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews
consisting of 158,065 samples classified into three broad categories: positive,
negative, and neutral. We provide a detailed statistical analysis of the
dataset and employ a range of machine learning models to establish baselines
including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial
performance advantage of pre-trained models over models that rely on manually
crafted features, emphasizing the necessity for additional training resources
in this domain. Additionally, we conduct an in-depth error analysis by
examining sentiment unigrams, which may provide insight into common
classification errors in under-resourced languages like Bangla. Our codes and
data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
Related papers
- Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT [1.688134675717698]
We develop a novel approach that integrates rule-based algorithms with pre-trained language models.
We developed a novel rule based algorithm Bangla Sentiment Polarity Score ( BSPS), an approach capable of generating sentiment scores and classifying reviews into nine distinct sentiment categories.
Our analysis revealed that the BSPS + BanglaBERT hybrid approach outperformed the standalone BanglaBERT model, achieving higher accuracy, precision, and nuanced classification across the nine sentiment categories.
arXiv Detail & Related papers (2024-11-29T09:57:11Z) - Learning from Neighbors: Category Extrapolation for Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance.
We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes.
To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.
We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z) - Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis [6.471458199049549]
In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments.
We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz.
Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios.
arXiv Detail & Related papers (2023-08-21T15:19:10Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank.
It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries.
It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Scaling Federated Learning for Fine-tuning of Large Language Models [0.5405981353784006]
Federated learning (FL) is a promising approach to distributed compute, as well as distributed data, and provides a level of privacy and compliance to legal frameworks.
In this paper, we explore the fine-tuning of Transformer-based language models in a federated learning setting.
We perform an extensive sweep over the number of clients, ranging up to 32, to evaluate the impact of distributed compute on task performance.
arXiv Detail & Related papers (2021-02-01T14:31:39Z) - Sentiment Classification in Bangla Textual Content: A Comparative Study [4.2394281761764]
In this study, we explore several publicly available sentiment labeled datasets and designed classifiers using both classical and deep learning algorithms.
Our finding suggests transformer-based models, which have not been explored earlier for Bangla, outperform all other models.
arXiv Detail & Related papers (2020-11-19T21:06:28Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.