Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages
- URL: http://arxiv.org/abs/2303.13592v4
- Date: Tue, 12 Sep 2023 16:35:30 GMT
- Title: Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages
- Authors: Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun
Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang
Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia,
Thamar Solorio, Alham Fikri Aji
- Abstract summary: We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
- Score: 47.78634360870564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While code-mixing is a common linguistic practice in many parts of the world,
collecting high-quality and low-cost code-mixed data remains a challenge for
natural language processing (NLP) research. The recent proliferation of Large
Language Models (LLMs) compels one to ask: how capable are these systems in
generating code-mixed data? In this paper, we explore prompting multilingual
LLMs in a zero-shot manner to generate code-mixed data for seven languages in
South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese,
Tamil, and Singlish. We find that publicly available multilingual
instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of
producing texts with phrases or clauses from different languages. ChatGPT
exhibits inconsistent capabilities in generating code-mixed texts, wherein its
performance varies depending on the prompt template and language pairing. For
instance, ChatGPT generates fluent and natural Singlish texts (an English-based
creole spoken in Singapore), but for English-Tamil language pair, the system
mostly produces grammatically incorrect or semantically meaningless utterances.
Furthermore, it may erroneously introduce languages not specified in the
prompt. Based on our investigation, existing multilingual LLMs exhibit a wide
range of proficiency in code-mixed data generation for SEA languages. As such,
we advise against using LLMs in this context without extensive human checks.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities [3.359458926468223]
Rule-Based Prompting is a novel prompting technique to generate code-mixed sentences.
We measure and compare the code-mixed MT abilities of 3 popular multilingual LLMs.
We also use $k$-shot prompting to gauge the code-mixed to English translation abilities of multilingual LLMs.
arXiv Detail & Related papers (2024-10-14T20:40:36Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Code-mixed Sentiment and Hate-speech Prediction [2.9140539998069803]
Large language models have dominated most natural language processing tasks.
We created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages.
We performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages.
arXiv Detail & Related papers (2024-05-21T16:56:36Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien [5.272372029223681]
In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants.
We propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method.
arXiv Detail & Related papers (2023-01-21T11:04:20Z) - CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts [0.0]
Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
arXiv Detail & Related papers (2022-11-17T19:16:56Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.