OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification
- URL: http://arxiv.org/abs/2310.18387v2
- Date: Sat, 25 Nov 2023 13:13:01 GMT
- Title: OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification
- Authors: Dhiman Goswami, Md Nishat Raihan, Antara Mahmud, Antonios
Anastasopoulos, Marcos Zampieri
- Abstract summary: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
We introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages.
- Score: 26.11758147703999
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more
languages are mixed in text or speech. Several works have been conducted on
building datasets and performing downstream NLP tasks on code-mixed data.
Although it is not uncommon to observe code-mixing of three or more languages,
most available datasets in this domain contain code-mixed data from only two
languages. In this paper, we introduce OffMix-3L, a novel offensive language
identification dataset containing code-mixed data from three different
languages. We experiment with several models on this dataset and observe that
BanglishBERT outperforms other transformer-based models and GPT-3.5.
Related papers
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? [112.0422370149713]
We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data.
We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers.
We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
arXiv Detail & Related papers (2024-07-23T16:13:22Z) - TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection [24.344204661349327]
Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech.
EmoMix-3L is a novel multi-label emotion detection dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2024-05-11T05:58:55Z) - SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment
Analysis [26.11758147703999]
SentMix-3L is a novel dataset for sentiment analysis containing code-mixed data between three languages.
We show that GPT-3.5 outperforms all transformer-based models on SentMix-3L.
arXiv Detail & Related papers (2023-10-27T09:59:24Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.