MaLLaM -- Malaysia Large Language Model
- URL: http://arxiv.org/abs/2401.14680v2
- Date: Mon, 29 Jan 2024 07:18:59 GMT
- Title: MaLLaM -- Malaysia Large Language Model
- Authors: Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
- Abstract summary: We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset.
MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addressing the gap in Large Language Model pretrained from scratch with
Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion
parameters on a substantial 349GB dataset, equivalent to 90 billion tokens
based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch.
MaLLaM contributes to enhanced natural language understanding and generation
tasks in the Malay language. Although trained on a smaller dataset of 90
billion tokens, our instruction-tuned MaLLaM models perform competitively. When
compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models
demonstrate notable proficiency, underscoring the effectiveness of our approach
in capturing and understanding the nuances of the Malaysian language. MaLLaM
models mark a significant contribution to the field, providing comprehensive
language representations grounded in Malaysian context. This endeavor aims to
pave the way for enhanced natural language understanding and generation tasks
specific to the linguistic nuances present in Malaysia. We discuss the training
methodology, dataset composition, and the potential impact of MaLLaM in
advancing the capabilities of large language models within the context of the
Malay language.
All models released at
https://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f
Related papers
- MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish [17.36441080071885]
This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish.
Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models.
arXiv Detail & Related papers (2024-12-21T05:50:48Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Bridging the Gap: Transfer Learning from English PLMs to Malaysian English [1.8241632171540025]
Malaysian English is a low resource creole language.
Named Entity Recognition models underperform when capturing entities from Malaysian English text.
We introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding.
arXiv Detail & Related papers (2024-07-01T15:26:03Z) - Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages [3.9018931027384056]
We present "Paramanu", a family of novel language models (LM) for Indian languages.
It covers 10 languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts.
The models are pretrained on a single GPU with context size of 1024 and vary in size from 13.29 million (M) to 367.5 M parameters.
arXiv Detail & Related papers (2024-01-31T17:58:10Z) - Large Malaysian Language Model Based on Mistral for Enhanced Local
Language Understanding [0.0]
We present significant advancements in the pretraining of Mistral 7B, a large-scale language model.
We release models with context lengths of 4096 and 32768 tokens, and further refine performance with a specialized 16384 context length instruction-tuned model.
We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set.
arXiv Detail & Related papers (2024-01-24T16:21:28Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - SMaLL-100: Introducing Shallow Multilingual Machine Translation Model
for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages.
We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages.
Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.