Related papers: Kanana: Compute-efficient Bilingual Language Models

Kanana: Compute-efficient Bilingual Language Models

URL: http://arxiv.org/abs/2502.18934v3
Date: Fri, 28 Feb 2025 14:23:16 GMT
Title: Kanana: Compute-efficient Bilingual Language Models
Authors: Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo,
Abstract summary: Kanana is a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English.<n>The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models.<n>The report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling.
Score: 9.597618914676106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

Related papers

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages [16.671158083515373]
We develop a fluent preference-aligned language model without instruction-tuning data in the target language.<n>Our approach uses an on-policy training method, which we compare with two common approaches.<n>We conduct a case study on Norwegian Bokml and evaluate fluency through native-speaker assessments.
arXiv Detail & Related papers (2025-12-09T16:31:48Z)
PLaMo 2 Technical Report [9.166942912957724]
We introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture.<n>PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.
arXiv Detail & Related papers (2025-09-05T08:17:59Z)
A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic [9.380879437204277]
An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model.<n>We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data.<n>We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.
arXiv Detail & Related papers (2025-06-05T11:52:08Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining [4.38070902806635]
We set up a benchmark for languages Croatian, Serbian, Bosnian and Montenegrin. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
arXiv Detail & Related papers (2024-04-08T11:55:44Z)
CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens. We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio. To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z)
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models. We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z)
BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages [8.64545246732563]
We introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines. We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada languages. We propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data.
arXiv Detail & Related papers (2023-10-17T21:05:20Z)
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings. We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z)
From Good to Best: Two-Stage Training for Cross-lingual Machine Reading Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance. The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer. The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z)
Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.