Related papers: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

URL: http://arxiv.org/abs/2502.12982v1
Date: Tue, 18 Feb 2025 16:04:57 GMT
Title: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Authors: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin,
Abstract summary: Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes.<n>Sailor2 undergoes continuous pre-training on 500B tokens to support 13 SEA languages while retaining proficiency in Chinese and English.<n>Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages.
Score: 61.117234373024665
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.

Related papers

SEA-LION: Southeast Asian Languages in One Network [16.12423506306059]
We introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer.
arXiv Detail & Related papers (2025-04-08T07:24:51Z)
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models [2.119348427296952]
SEA-HELM is a comprehensive and authentic evaluation suite for languages in the Southeast Asian (SEA) region. It comprises five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese.
arXiv Detail & Related papers (2025-02-20T06:32:45Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
Sailor: Open Language Models for South-East Asia [31.8422378772881]
Sailor is a family of open language models tailored for South-East Asian (SEA) languages. Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
arXiv Detail & Related papers (2024-04-04T17:31:32Z)
Teaching Large Language Models an Unseen Language on the Fly [32.83773919852362]
We introduce DiPMT++, a framework for adapting LLMs to unseen languages by in-context learning. Using a dictionary and 5K parallel sentences only, DiPMT++ significantly enhances the performance of GPT-4 from 0 to 16 BLEU for Chinese-to-Zhuang translation. We also validate the effectiveness of our framework on Kalamang, another unseen language.
arXiv Detail & Related papers (2024-02-29T13:50:47Z)
PALO: A Polyglot Large Multimodal Model for 5B People [23.839200109033538]
In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese.
arXiv Detail & Related papers (2024-02-22T18:59:58Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z)
Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.