Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russia
- URL: http://arxiv.org/abs/2405.13929v3
- Date: Sat, 26 Oct 2024 08:47:36 GMT
- Title: Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russia
- Authors: Aleksandr Nikolich, Konstantin Korolev, Sergei Bratchikov, Igor Kiselev, Artem Shelmanov,
- Abstract summary: Vikhr is a state-of-the-art bilingual open-source instruction-following LLM designed specifically for the Russian language.
"Vikhr" refers to the name of the Mistral LLM series and means a "strong gust of wind"
- Score: 44.13635168077528
- License:
- Abstract: There has been a surge in developing various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and reduced computational performance due to the disproportionate representation of tokens in the model's vocabulary. In this work, we address these issues by developing a pipeline for adapting English-oriented pre-trained models to other languages and constructing efficient bilingual LLMs. Using this pipeline, we construct Vikhr, a state-of-the-art bilingual open-source instruction-following LLM designed specifically for the Russian language. "Vikhr" refers to the name of the Mistral LLM series and means a "strong gust of wind." Unlike previous Russian-language models that typically rely on LoRA adapters on top of English-oriented models, sacrificing performance for lower training costs, Vikhr features an adapted tokenizer vocabulary and undergoes continued pre-training and instruction tuning of all weights. This not only enhances the model's performance but also significantly improves its computational and contextual efficiency. The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets a new state of the art among open-source LLMs for Russian but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available.
Related papers
- Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models [7.998168689120558]
Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks.
The efficacy of such models to languages other than English is often limited.
We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages.
arXiv Detail & Related papers (2024-10-21T16:33:16Z) - MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing [78.62611800987817]
Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data.
We propose a method called MoE-LPR (Mixture-of-Experts with Language Priors) to enhance the multilingual capability.
arXiv Detail & Related papers (2024-08-21T07:43:49Z) - Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities [2.047424180164312]
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges.
We introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English.
arXiv Detail & Related papers (2024-07-09T17:51:37Z) - Machine Translation with Large Language Models: Prompt Engineering for
Persian, English, and Russian Directions [0.0]
Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks.
We conducted an investigation into two popular prompting methods and their combination, focusing on cross-language combinations of Persian, English, and Russian.
arXiv Detail & Related papers (2024-01-16T15:16:34Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Testing pre-trained Transformer models for Lithuanian news clustering [0.0]
Non-English languages could not leverage such new opportunities with the English text pre-trained models.
We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering.
Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings.
arXiv Detail & Related papers (2020-04-03T14:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.