Efficient and Effective Vocabulary Expansion Towards Multilingual Large
Language Models
- URL: http://arxiv.org/abs/2402.14714v1
- Date: Thu, 22 Feb 2024 17:12:39 GMT
- Title: Efficient and Effective Vocabulary Expansion Towards Multilingual Large
Language Models
- Authors: Seungduk Kim, Seungtaek Choi, Myeongho Jeong
- Abstract summary: This report introduces textttEEVE-Korean-v1.0, a Korean adaptation of large language models.
Our method can significantly boost non-English proficiency within just 2 billion tokens.
- Score: 9.359647125218359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of
large language models that exhibit remarkable capabilities across English and
Korean text understanding. Building on recent highly capable but
English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts
are inefficiently processed with English-centric tokenizers, we present an
efficient and effective vocabulary expansion (EEVE) method, which encompasses
parameter freezing and subword initialization. In contrast to previous efforts
that believe new embeddings require trillions of training tokens, we show that
our method can significantly boost non-English proficiency within just 2
billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM
Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0}
ranks as the leading Korean pre-trained model in the open-source community,
according to Hugging Face's leaderboard. We open-source our models on
Huggingface to empower the open research community in various languages.
Related papers
- RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing.
RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline.
Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z) - GECKO: Generative Language Model for English, Code and Korean [0.02046223849354785]
We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages.
GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture.
arXiv Detail & Related papers (2024-05-24T15:30:41Z) - KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models [0.0]
textitKIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks.
The experimental results show that the model trained on textitKIT-19 significantly outperforms existing Korean LLMs.
arXiv Detail & Related papers (2024-03-25T06:15:21Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean
Language Models [6.907247943327277]
Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models.
We introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature.
arXiv Detail & Related papers (2023-06-04T04:04:04Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - KoreALBERT: Pretraining a Lite BERT Model for Korean Language
Understanding [6.414554168135807]
KoreALBERT is a monolingual ALBERT model specifically for Korean language understanding.
Our pretrained KoreALBERT outperforms its BERT counterpart on 6 different NLU tasks.
arXiv Detail & Related papers (2021-01-27T12:48:53Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.