A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
- URL: http://arxiv.org/abs/2006.00210v1
- Date: Sat, 30 May 2020 07:32:37 GMT
- Title: A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
- Authors: Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth
Sherly, John P. McCrae
- Abstract summary: This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
- Score: 0.8454131372606295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an increasing demand for sentiment analysis of text from social
media which are mostly code-mixed. Systems trained on monolingual data fail for
code-mixed data due to the complexity of mixing at different levels of the
text. However, very few resources are available for code-mixed data to create
models specific for this data. Although much research in multilingual and
cross-lingual sentiment analysis has used semi-supervised or unsupervised
methods, supervised methods still performs better. Only a few datasets for
popular languages such as English-Spanish, English-Hindi, and English-Chinese
are available. There are no resources available for Malayalam-English
code-mixed data. This paper presents a new gold standard corpus for sentiment
analysis of code-mixed text in Malayalam-English annotated by voluntary
annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8
for the dataset. We use this new corpus to provide the benchmark for sentiment
analysis in Malayalam-English code-mixed texts.
Related papers
- BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis [0.08246494848934446]
We introduce BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with $4$ sentiment labels from Facebook, YouTube, and e-commerce sites.
We achieve an overall accuracy of $69.8%$ and an F1 score of $69.1%$ on sentiment classification tasks.
arXiv Detail & Related papers (2024-08-16T18:30:22Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis [1.2568978992326025]
This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks.
We also provide a linguistics analysis of the dataset to understand the structure of code-mixing.
arXiv Detail & Related papers (2024-03-07T16:29:19Z) - My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks [0.7874708385247353]
We focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing.
We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining.
We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus.
arXiv Detail & Related papers (2023-06-24T18:17:38Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Sentiment Analysis of Persian-English Code-mixed Texts [0.0]
Due to the unstructured nature of social media data, we are observing more instances of multilingual and code-mixed texts.
In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets.
We introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets.
arXiv Detail & Related papers (2021-02-25T06:05:59Z) - NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of
Code-Mixed Dravidian text using XLNet [0.0]
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication.
It looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.
Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.
This paper uses an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
arXiv Detail & Related papers (2020-10-15T14:09:02Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.