Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages
- URL: http://arxiv.org/abs/2303.13592v4
- Date: Tue, 12 Sep 2023 16:35:30 GMT
- Title: Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages
- Authors: Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun
Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang
Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia,
Thamar Solorio, Alham Fikri Aji
- Abstract summary: We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
- Score: 47.78634360870564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While code-mixing is a common linguistic practice in many parts of the world,
collecting high-quality and low-cost code-mixed data remains a challenge for
natural language processing (NLP) research. The recent proliferation of Large
Language Models (LLMs) compels one to ask: how capable are these systems in
generating code-mixed data? In this paper, we explore prompting multilingual
LLMs in a zero-shot manner to generate code-mixed data for seven languages in
South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese,
Tamil, and Singlish. We find that publicly available multilingual
instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of
producing texts with phrases or clauses from different languages. ChatGPT
exhibits inconsistent capabilities in generating code-mixed texts, wherein its
performance varies depending on the prompt template and language pairing. For
instance, ChatGPT generates fluent and natural Singlish texts (an English-based
creole spoken in Singapore), but for English-Tamil language pair, the system
mostly produces grammatically incorrect or semantically meaningless utterances.
Furthermore, it may erroneously introduce languages not specified in the
prompt. Based on our investigation, existing multilingual LLMs exhibit a wide
range of proficiency in code-mixed data generation for SEA languages. As such,
we advise against using LLMs in this context without extensive human checks.
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Code-mixed Sentiment and Hate-speech Prediction [2.9140539998069803]
Large language models have dominated most natural language processing tasks.
We created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages.
We performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages.
arXiv Detail & Related papers (2024-05-21T16:56:36Z) - Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot [1.3741556944830366]
This study prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences.
The quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate.
We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT.
arXiv Detail & Related papers (2024-04-26T07:44:44Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien [5.272372029223681]
In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants.
We propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method.
arXiv Detail & Related papers (2023-01-21T11:04:20Z) - CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts [0.0]
Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
arXiv Detail & Related papers (2022-11-17T19:16:56Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.