A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean
Language Models
- URL: http://arxiv.org/abs/2306.02254v2
- Date: Tue, 6 Jun 2023 03:27:33 GMT
- Title: A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean
Language Models
- Authors: Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang,
Jiwung Hyun, Sungho Park, Kyubyong Park
- Abstract summary: Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models.
We introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature.
- Score: 6.907247943327277
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Polyglot is a pioneering project aimed at enhancing the non-English language
performance of multilingual language models. Despite the availability of
various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et
al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often
resort to building monolingual models in their respective languages due to the
dissatisfaction with the current multilingual models non-English language
capabilities. Addressing this gap, we seek to develop advanced multilingual
language models that offer improved performance in non-English languages. In
this paper, we introduce the Polyglot Korean models, which represent a specific
focus rather than being multilingual in nature. In collaboration with TUNiB,
our team collected 1.2TB of Korean data meticulously curated for our research
journey. We made a deliberate decision to prioritize the development of Korean
models before venturing into multilingual models. This choice was motivated by
multiple factors: firstly, the Korean models facilitated performance
comparisons with existing multilingual models; and finally, they catered to the
specific needs of Korean companies and researchers. This paper presents our
work in developing the Polyglot Korean models, which propose some steps towards
addressing the non-English language performance gap in multilingual language
models.
Related papers
- Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages.
We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z) - Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models [110.10545153845051]
Cross-lingual Expert Language Models (X-ELM) is a process that specializes X-ELMs to different languages while remaining effective as a multilingual ensemble.
X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting.
arXiv Detail & Related papers (2024-01-19T01:07:50Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Lost in Translation: Large Language Models in Non-English Content
Analysis [0.0]
Large language models have become the dominant approach for building AI systems to analyze and generate language online.
Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English.
arXiv Detail & Related papers (2023-06-12T19:10:47Z) - Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual
Conversational Agent Models [1.52292571922932]
We propose a general multilingual model framework for Natural Language Understanding (NLU) models.
We show that these multilingual models can reach same or better performance compared to monolingual models across language-specific test data.
arXiv Detail & Related papers (2020-12-07T17:14:52Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.