Sentiment analysis in Bengali via transfer learning using multi-lingual
BERT
- URL: http://arxiv.org/abs/2012.07538v1
- Date: Thu, 3 Dec 2020 10:21:11 GMT
- Title: Sentiment analysis in Bengali via transfer learning using multi-lingual
BERT
- Authors: Khondoker Ittehadul Islam, Md. Saiful Islam and Md Ruhul Amin
- Abstract summary: In this paper, we present manually tagged 2-class and 3-class SA datasets in Bengali.
We also demonstrate that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning.
This deep learning model achieves an accuracy of 71% for 2-class sentiment classification compared to the current state-of-the-art accuracy of 68%.
- Score: 0.9883261192383611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentiment analysis (SA) in Bengali is challenging due to this Indo-Aryan
language's highly inflected properties with more than 160 different inflected
forms for verbs and 36 different forms for noun and 24 different forms for
pronouns. The lack of standard labeled datasets in the Bengali domain makes the
task of SA even harder. In this paper, we present manually tagged 2-class and
3-class SA datasets in Bengali. We also demonstrate that the multi-lingual BERT
model with relevant extensions can be trained via the approach of transfer
learning over those novel datasets to improve the state-of-the-art performance
in sentiment classification tasks. This deep learning model achieves an
accuracy of 71\% for 2-class sentiment classification compared to the current
state-of-the-art accuracy of 68\%. We also present the very first Bengali SA
classifier for the 3-class manually tagged dataset, and our proposed model
achieves an accuracy of 60\%. We further use this model to analyze the
sentiment of public comments in the online daily newspaper. Our analysis shows
that people post negative comments for political or sports news more often,
while the religious article comments represent positive sentiment. The dataset
and code is publicly available at
https://github.com/KhondokerIslam/Bengali\_Sentiment.
Related papers
- ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Bengali Handwritten Grapheme Classification: Deep Learning Approach [0.0]
We participate in a Kaggle competition citek_link where the challenge is to classify three constituent elements of a Bengali grapheme in the image.
We explore the performances of some existing neural network models such as Multi-Layer Perceptron (MLP) and state of the art ResNet50.
We propose our own convolution neural network (CNN) model for Bengali grapheme classification with validation root accuracy 95.32%, vowel accuracy 98.61%, and consonant accuracy 98.76%.
arXiv Detail & Related papers (2021-11-16T06:14:59Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and
it's baseline evaluation [0.8793721044482612]
We present a manually annotated Bengali dataset of high quality, BAN-ABSA, which is annotated with aspect and its associated sentiment by 3 native Bengali speakers.
The dataset consists of 2,619 positive, 4,721 negative and 1,669 neutral data samples from 9,009 unique comments gathered from some famous Bengali news portals.
arXiv Detail & Related papers (2020-12-01T06:09:44Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.