BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
- URL: http://arxiv.org/abs/2408.08964v2
- Date: Sun, 20 Oct 2024 18:59:30 GMT
- Title: BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
- Authors: Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal,
- Abstract summary: We introduce BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites.
We achieve an overall accuracy of $69.8%$ and an F1 score of $69.1%$ on sentiment classification tasks.
- Score: 0.08246494848934446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose $14$ baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of $69.8\%$ and an F1 score of $69.1\%$ on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
Related papers
- A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment
Analysis [26.11758147703999]
SentMix-3L is a novel dataset for sentiment analysis containing code-mixed data between three languages.
We show that GPT-3.5 outperforms all transformer-based models on SentMix-3L.
arXiv Detail & Related papers (2023-10-27T09:59:24Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - DravidianCodeMix: Sentiment Analysis and Offensive Language
Identification Dataset for Dravidian Languages in Code-Mixed Text [0.9738927161150494]
The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha.
arXiv Detail & Related papers (2021-06-17T13:13:26Z) - Sentiment Analysis of Persian-English Code-mixed Texts [0.0]
Due to the unstructured nature of social media data, we are observing more instances of multilingual and code-mixed texts.
In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets.
We introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets.
arXiv Detail & Related papers (2021-02-25T06:05:59Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - CMSAOne@Dravidian-CodeMix-FIRE2020: A Meta Embedding and Transformer
model for Code-Mixed Sentiment Analysis on Social Media Text [9.23545668304066]
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence.
Sentiment analysis (SA) is a fundamental step in NLP and is well studied in the monolingual text.
This paper proposes a meta embedding with a transformer method for sentiment analysis on the Dravidian code-mixed dataset.
arXiv Detail & Related papers (2021-01-22T08:48:27Z) - A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [0.8454131372606295]
This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
arXiv Detail & Related papers (2020-05-30T07:32:37Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.