EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection
- URL: http://arxiv.org/abs/2405.06922v1
- Date: Sat, 11 May 2024 05:58:55 GMT
- Title: EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection
- Authors: Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri,
- Abstract summary: Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech.
EmoMix-3L is a novel multi-label emotion detection dataset containing code-mixed data from three different languages.
- Score: 24.344204661349327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.
Related papers
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? [112.0422370149713]
We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data.
We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers.
We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
arXiv Detail & Related papers (2024-07-23T16:13:22Z) - OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification [26.11758147703999]
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
We introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2023-10-27T09:59:35Z) - SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment
Analysis [26.11758147703999]
SentMix-3L is a novel dataset for sentiment analysis containing code-mixed data between three languages.
We show that GPT-3.5 outperforms all transformer-based models on SentMix-3L.
arXiv Detail & Related papers (2023-10-27T09:59:24Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR
Segmentation [62.259239847977014]
We propose a new approach of sample mixing for point cloud UDA, namely Compositional Semantic Mix (CoSMix)
CoSMix consists of a two-branch symmetric network that can process labelled synthetic data (source) and real-world unlabelled point clouds (target) concurrently.
We evaluate CoSMix on two large-scale datasets, showing that it outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-07-20T09:33:42Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv Detail & Related papers (2022-06-16T08:00:42Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [0.8454131372606295]
This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
arXiv Detail & Related papers (2020-05-30T07:32:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.