Sailor: Open Language Models for South-East Asia
- URL: http://arxiv.org/abs/2404.03608v1
- Date: Thu, 4 Apr 2024 17:31:32 GMT
- Title: Sailor: Open Language Models for South-East Asia
- Authors: Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, Min Lin,
- Abstract summary: Sailor is a family of open language models tailored for South-East Asian (SEA) languages.
Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
- Score: 31.8422378772881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.
Related papers
- Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs [61.117234373024665]
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes.
Sailor2 undergoes continuous pre-training on 500B tokens to support 13 SEA languages while retaining proficiency in Chinese and English.
Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages.
arXiv Detail & Related papers (2025-02-18T16:04:57Z) - Xmodel-1.5: An 1B-scale Multilingual LLM [4.298869484709548]
We introduce Xmodel-1.5, a multilingual large language model pretrained on 2 trillion tokens.
Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy.
The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English.
arXiv Detail & Related papers (2024-11-15T10:01:52Z) - PALO: A Polyglot Large Multimodal Model for 5B People [23.839200109033538]
In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multimodal Model called PALO.
PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese.
arXiv Detail & Related papers (2024-02-22T18:59:58Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.