Large Malaysian Language Model Based on Mistral for Enhanced Local
Language Understanding
- URL: http://arxiv.org/abs/2401.13565v3
- Date: Sun, 4 Feb 2024 06:52:28 GMT
- Title: Large Malaysian Language Model Based on Mistral for Enhanced Local
Language Understanding
- Authors: Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
- Abstract summary: We present significant advancements in the pretraining of Mistral 7B, a large-scale language model.
We release models with context lengths of 4096 and 32768 tokens, and further refine performance with a specialized 16384 context length instruction-tuned model.
We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present significant advancements in the pretraining of
Mistral 7B, a large-scale language model, using a dataset of 32.6 GB,
equivalent to 1.1 billion tokens. We explore the impact of extending the
context length, releasing models with context lengths of 4096 and 32768 tokens,
and further refining performance with a specialized 16384 context length
instruction-tuned model, we called it Malaysian Mistral.
Our experiments demonstrate the efficacy of continue pretraining and the
influence of extended context lengths on Mistral 7B's language understanding
capabilities. Additionally, we release a model specifically tuned with a 16384
context length instruction, showcasing its potential for capturing nuanced
language intricacies.
Furthermore, our research contributes to the benchmarking of Malaysian
Mistral against prominent language models, including ChatGPT3.5 and Claude 2.
We present compelling results indicating Malaysian Mistral's superior
performance on Tatabahasa (Malay grammar) test set, particularly when
fine-tuned with instructions.
All models released at
https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c
Related papers
- Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier [72.5652085347547]
We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models.
By leveraging several years of research at Cohere For AI and Cohere, Aya Expanse sets a new state-of-the-art in multilingual performance.
Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models.
arXiv Detail & Related papers (2024-12-05T15:41:06Z) - Feriji: A French-Zarma Parallel Corpus, Glossary & Translator [3.3073775218038883]
This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for machine translation.
We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model.
arXiv Detail & Related papers (2024-06-09T19:08:33Z) - Aya 23: Open Weight Releases to Further Multilingual Progress [47.673416416949145]
Aya 23 builds on the recent release of the Aya model ("Ust"un et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection.
The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population.
arXiv Detail & Related papers (2024-05-23T20:10:38Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - MaLLaM -- Malaysia Large Language Model [0.0]
We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset.
MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language.
arXiv Detail & Related papers (2024-01-26T06:56:05Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - Assessing Translation capabilities of Large Language Models involving
English and Indian Languages [4.067706269490143]
We explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages.
We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning.
Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively.
arXiv Detail & Related papers (2023-11-15T18:58:19Z) - Mistral 7B [62.17530433867458]
Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks.
arXiv Detail & Related papers (2023-10-10T17:54:58Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.