My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks
- URL: http://arxiv.org/abs/2306.14030v2
- Date: Thu, 20 Jul 2023 13:54:05 GMT
- Title: My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks
- Authors: Tanmay Chavan, Omkar Gokhale, Aditya Kane, Shantanu Patankar, Raviraj
Joshi
- Abstract summary: We focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing.
We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining.
We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The research on code-mixed data is limited due to the unavailability of
dedicated code-mixed datasets and pre-trained language models. In this work, we
focus on the low-resource Indian language Marathi which lacks any prior work in
code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English
(Mr-En) corpus with 10 million social media sentences for pretraining. We also
release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models
pre-trained on MeCorpus. Furthermore, for benchmarking, we present three
supervised datasets MeHate, MeSent, and MeLID for downstream tasks like
code-mixed Mr-En hate speech detection, sentiment analysis, and language
identification respectively. These evaluation datasets individually consist of
manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations
show that the models trained on this novel corpus significantly outperform the
existing state-of-the-art BERT models. This is the first work that presents
artifacts for code-mixed Marathi research. All datasets and models are publicly
released at https://github.com/l3cube-pune/MarathiNLP .
Related papers
- Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? [112.0422370149713]
We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data.
We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers.
We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
arXiv Detail & Related papers (2024-07-23T16:13:22Z) - TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment
analysis of code-mixed text in Dravidian languages [0.0]
This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam.
It describes the work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset.
The results are recorded in this research paper where the best models stood 4th, 5th and 10th ranks in the Tamil, Kannada and Malayalam tasks respectively.
arXiv Detail & Related papers (2021-11-15T16:57:59Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.