Marathi-English Code-mixed Text Generation
- URL: http://arxiv.org/abs/2309.16202v1
- Date: Thu, 28 Sep 2023 06:51:26 GMT
- Title: Marathi-English Code-mixed Text Generation
- Authors: Dhiraj Amin, Sharvari Govilkar, Sagar Kulkarni, Yash Shashikant Lalit,
Arshi Ajaz Khwaja, Daries Xavier, Sahil Girijashankar Gupta
- Abstract summary: Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code-mixing, the blending of linguistic elements from distinct languages to
form meaningful sentences, is common in multilingual settings, yielding hybrid
languages like Hinglish and Minglish. Marathi, India's third most spoken
language, often integrates English for precision and formality. Developing
code-mixed language systems, like Marathi-English (Minglish), faces resource
constraints. This research introduces a Marathi-English code-mixed text
generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code
Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average
CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible
code-mixed sentences. These results offer potential for enhanced NLP tools,
bridging linguistic gaps in multilingual societies.
Related papers
- Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications.
Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs)
LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z) - IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
Local Languages [62.60787450345489]
We explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay.
Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing.
arXiv Detail & Related papers (2023-11-21T07:50:53Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
Text Using Transformers [1.181206257787103]
This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
arXiv Detail & Related papers (2022-06-17T10:36:50Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG
Evaluation [1.2559148369195197]
Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text.
Various widely popular metrics perform poorly with the code-mixed NLG tasks.
We present a metric independent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments.
arXiv Detail & Related papers (2021-07-24T05:24:26Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Word Level Language Identification in English Telugu Code Mixed Data [7.538482310185133]
Intrasentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays.
We present a study of various models - Nave Bayes, Random Forest, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification.
Our best performing system is CRF-based with an f1-score of 0.91.
arXiv Detail & Related papers (2020-10-09T10:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.