SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment
Analysis
- URL: http://arxiv.org/abs/2310.18023v2
- Date: Wed, 29 Nov 2023 10:33:26 GMT
- Title: SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment
Analysis
- Authors: Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios
Anastasopoulos, Marcos Zampieri
- Abstract summary: SentMix-3L is a novel dataset for sentiment analysis containing code-mixed data between three languages.
We show that GPT-3.5 outperforms all transformer-based models on SentMix-3L.
- Score: 26.11758147703999
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more
languages are mixed in text or speech. Several datasets have been build with
the goal of training computational models for code-mixing. Although it is very
common to observe code-mixing with multiple languages, most datasets available
contain code-mixed between only two languages. In this paper, we introduce
SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data
between three languages Bangla, English, and Hindi. We carry out a
comprehensive evaluation using SentMix-3L. We show that zero-shot prompting
with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
Related papers
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? [112.0422370149713]
We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data.
We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers.
We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
arXiv Detail & Related papers (2024-07-23T16:13:22Z) - EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection [24.344204661349327]
Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech.
EmoMix-3L is a novel multi-label emotion detection dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2024-05-11T05:58:55Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification [26.11758147703999]
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
We introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2023-10-27T09:59:35Z) - My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks [0.7874708385247353]
We focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing.
We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining.
We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus.
arXiv Detail & Related papers (2023-06-24T18:17:38Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR
Segmentation [62.259239847977014]
We propose a new approach of sample mixing for point cloud UDA, namely Compositional Semantic Mix (CoSMix)
CoSMix consists of a two-branch symmetric network that can process labelled synthetic data (source) and real-world unlabelled point clouds (target) concurrently.
We evaluate CoSMix on two large-scale datasets, showing that it outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-07-20T09:33:42Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment
analysis of code-mixed text in Dravidian languages [0.0]
This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam.
It describes the work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset.
The results are recorded in this research paper where the best models stood 4th, 5th and 10th ranks in the Tamil, Kannada and Malayalam tasks respectively.
arXiv Detail & Related papers (2021-11-15T16:57:59Z) - A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [0.8454131372606295]
This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
arXiv Detail & Related papers (2020-05-30T07:32:37Z) - MixText: Linguistically-Informed Interpolation of Hidden Space for
Semi-Supervised Text Classification [68.15015032551214]
MixText is a semi-supervised learning method for text classification.
TMix creates a large amount of augmented training samples by interpolating text in hidden space.
We leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data.
arXiv Detail & Related papers (2020-04-25T21:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.