MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
- URL: http://arxiv.org/abs/2403.04639v2
- Date: Fri, 22 Mar 2024 17:28:42 GMT
- Title: MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
- Authors: Priya Rani, Gaurav Negi, Theodorus Fransen, John P. McCrae,
- Abstract summary: This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks.
We also provide a linguistics analysis of the dataset to understand the structure of code-mixing.
- Score: 1.2568978992326025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset's quality.
Related papers
- BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla [0.0]
This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language.
Our dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty in profession, region, caste, and religion.
arXiv Detail & Related papers (2024-09-18T02:02:30Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification [26.11758147703999]
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
We introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2023-10-27T09:59:35Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Sentiment Analysis of Persian-English Code-mixed Texts [0.0]
Due to the unstructured nature of social media data, we are observing more instances of multilingual and code-mixed texts.
In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets.
We introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets.
arXiv Detail & Related papers (2021-02-25T06:05:59Z) - NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of
Code-Mixed Dravidian text using XLNet [0.0]
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication.
It looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.
Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.
This paper uses an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
arXiv Detail & Related papers (2020-10-15T14:09:02Z) - A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [0.8454131372606295]
This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
arXiv Detail & Related papers (2020-05-30T07:32:37Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.