Cendol: Open Instruction-tuned Generative Large Language Models for   Indonesian Languages
        - URL: http://arxiv.org/abs/2404.06138v2
- Date: Mon, 8 Jul 2024 03:33:52 GMT
- Title: Cendol: Open Instruction-tuned Generative Large Language Models for   Indonesian Languages
- Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung, 
- Abstract summary: Large language models (LLMs) show remarkable human-like capability in various domains and languages.
We introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures.
We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize.
- Score: 55.963648108438555
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning. 
 
      
        Related papers
        - FormosanBench: Benchmarking Low-Resource Austronesian Languages in the   Era of Large Language Models [1.2403152094314245]
 We introduce FORMOSANBENCH, the first benchmark for evaluating large language models (LLMs) on low-resource Austronesian languages.<n>We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH.<n>Our results reveal a substantial performance gap between high-resource and Formosan languages.
 arXiv  Detail & Related papers  (2025-06-12T07:02:28Z)
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
 Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
 arXiv  Detail & Related papers  (2024-10-06T08:51:30Z)
- SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models   for Southeast Asian Languages [77.75535024869224]
 We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
 arXiv  Detail & Related papers  (2024-07-29T03:26:22Z)
- Seamless Language Expansion: Enhancing Multilingual Mastery in   Self-Supervised Models [60.09618700199927]
 We propose adaptation methods which integrate LoRA to existed SSL models to extend new language.
We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages.
 arXiv  Detail & Related papers  (2024-06-20T08:13:30Z)
- Constructing and Expanding Low-Resource and Underrepresented Parallel   Datasets for Indonesian Local Languages [0.0]
 We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
 arXiv  Detail & Related papers  (2024-04-01T09:24:06Z)
- IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
  Local Languages [62.60787450345489]
 We explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay.
Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing.
 arXiv  Detail & Related papers  (2023-11-21T07:50:53Z)
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
  Extremely Low-Resource Languages [54.808217147579036]
 We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
 arXiv  Detail & Related papers  (2023-09-19T14:42:33Z)
- Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
 Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
 arXiv  Detail & Related papers  (2023-05-25T15:30:31Z)
- Can Character-based Language Models Improve Downstream Task Performance
  in Low-Resource and Noisy Language Scenarios? [0.0]
 We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
 arXiv  Detail & Related papers  (2021-10-26T14:59:16Z)
- Improving Indonesian Text Classification Using Multilingual Language
  Model [0.0]
 This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification models.
The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance.
 arXiv  Detail & Related papers  (2020-09-12T03:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.