Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities
- URL: http://arxiv.org/abs/2410.11079v1
- Date: Mon, 14 Oct 2024 20:40:36 GMT
- Title: Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities
- Authors: Ayushman Gupta, Akhil Bhogal, Kripabandhu Ghosh,
- Abstract summary: Rule-Based Prompting is a novel prompting technique to generate code-mixed sentences.
We measure and compare the code-mixed MT abilities of 3 popular multilingual LLMs.
We also use $k$-shot prompting to gauge the code-mixed to English translation abilities of multilingual LLMs.
- Score: 3.359458926468223
- License:
- Abstract: Multilingual Large Language Models (LLMs) have demonstrated exceptional performance in Machine Translation (MT) tasks. However, their MT abilities in the context of code-switching (the practice of mixing two or more languages in an utterance) remain under-explored. In this paper, we introduce Rule-Based Prompting, a novel prompting technique to generate code-mixed sentences. We measure and compare the code-mixed MT abilities of 3 popular multilingual LLMs: GPT-3.5-turbo, GPT-4, and Gemini Pro across five language pairs: English-{Hindi, Bengali, Gujarati, French, Spanish} using $k$-shot prompting ($k\in\{0, 1, 10, 20\}$) and Rule-Based Prompting. Our findings suggest that though $k$-shot prompting often leads to the best results, Rule-Based prompting shows promise in generating unique code-mixed sentences that vary in their style of code-mixing. We also use $k$-shot prompting to gauge the code-mixed to English translation abilities of multilingual LLMs. For this purpose, we create a gold-standard code-mixed dataset spanning five language pairs: English-{Hindi, Bengali, Gujarati, French, Spanish}. As a real-world application of our work, we create a code-mixed chatbot.
Related papers
- Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback [11.223762031003671]
Code-mixing introduces unique challenges in daily life, such as syntactic mismatches and semantic blending.
Large language models (LLMs) have revolutionized the field of natural language processing (NLP) by offering unprecedented capabilities in understanding human languages.
We propose to improve the multilingual LLMs' ability to understand code-mixing through reinforcement learning from human feedback (RLHF) and code-mixed machine translation tasks.
arXiv Detail & Related papers (2024-11-13T22:56:00Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods mainly focus on aligning vision encoders with Multimodal Large Language Models (MLLMs)
We introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level.
Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism.
LLMs initially understand the query, converting multilingual inputs into English for task-solving.
In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from
Multilingual Code-Mixed Text using Transformers [0.0]
This paper presents an automated system that can identify offensive text from multilingual code-mixed data.
datasets provided in three languages including Tamil, Malayalam and Kannada code-mixed with English.
arXiv Detail & Related papers (2021-02-28T11:10:32Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.