IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
Local Languages
- URL: http://arxiv.org/abs/2311.12405v1
- Date: Tue, 21 Nov 2023 07:50:53 GMT
- Title: IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
Local Languages
- Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata,
Pascale Fung, Ayu Purwarianti
- Abstract summary: We explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay.
Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing.
- Score: 62.60787450345489
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Significant progress has been made on Indonesian NLP. Nevertheless,
exploration of the code-mixing phenomenon in Indonesian is limited, despite
many languages being frequently mixed with Indonesian in daily conversation. In
this work, we explore code-mixing in Indonesian with four embedded languages,
i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a
framework to evaluate and improve the code-mixing robustness. Our analysis
shows that the pre-training corpus bias affects the model's ability to better
handle Indonesian-English code-mixing when compared to other local languages,
despite having higher language diversity.
Related papers
- Code-mixed Sentiment and Hate-speech Prediction [2.9140539998069803]
Large language models have dominated most natural language processing tasks.
We created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages.
We performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages.
arXiv Detail & Related papers (2024-05-21T16:56:36Z) - Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages [55.963648108438555]
Large language models (LLMs) show remarkable human-like capability in various domains and languages.
We introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures.
We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize.
arXiv Detail & Related papers (2024-04-09T09:04:30Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien [5.272372029223681]
In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants.
We propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method.
arXiv Detail & Related papers (2023-01-21T11:04:20Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.