BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from
Book Reviews
- URL: http://arxiv.org/abs/2305.06595v3
- Date: Thu, 8 Jun 2023 08:57:41 GMT
- Title: BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from
Book Reviews
- Authors: Mohsinul Kabir, Obayed Bin Mahfuz, Syed Rifat Raiyan, Hasan Mahmud and
Md Kamrul Hasan
- Abstract summary: We present a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral.
We employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT.
Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features.
- Score: 1.869097450593631
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The analysis of consumer sentiment, as expressed through reviews, can provide
a wealth of insight regarding the quality of a product. While the study of
sentiment analysis has been widely explored in many popular languages,
relatively less attention has been given to the Bangla language, mostly due to
a lack of relevant data and cross-domain adaptability. To address this
limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews
consisting of 158,065 samples classified into three broad categories: positive,
negative, and neutral. We provide a detailed statistical analysis of the
dataset and employ a range of machine learning models to establish baselines
including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial
performance advantage of pre-trained models over models that rely on manually
crafted features, emphasizing the necessity for additional training resources
in this domain. Additionally, we conduct an in-depth error analysis by
examining sentiment unigrams, which may provide insight into common
classification errors in under-resourced languages like Bangla. Our codes and
data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis [6.471458199049549]
In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments.
We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz.
Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios.
arXiv Detail & Related papers (2023-08-21T15:19:10Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank.
It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries.
It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Scaling Federated Learning for Fine-tuning of Large Language Models [0.5405981353784006]
Federated learning (FL) is a promising approach to distributed compute, as well as distributed data, and provides a level of privacy and compliance to legal frameworks.
In this paper, we explore the fine-tuning of Transformer-based language models in a federated learning setting.
We perform an extensive sweep over the number of clients, ranging up to 32, to evaluate the impact of distributed compute on task performance.
arXiv Detail & Related papers (2021-02-01T14:31:39Z) - BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and
it's baseline evaluation [0.8793721044482612]
We present a manually annotated Bengali dataset of high quality, BAN-ABSA, which is annotated with aspect and its associated sentiment by 3 native Bengali speakers.
The dataset consists of 2,619 positive, 4,721 negative and 1,669 neutral data samples from 9,009 unique comments gathered from some famous Bengali news portals.
arXiv Detail & Related papers (2020-12-01T06:09:44Z) - Sentiment Classification in Bangla Textual Content: A Comparative Study [4.2394281761764]
In this study, we explore several publicly available sentiment labeled datasets and designed classifiers using both classical and deep learning algorithms.
Our finding suggests transformer-based models, which have not been explored earlier for Bangla, outperform all other models.
arXiv Detail & Related papers (2020-11-19T21:06:28Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.