Related papers: Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

URL: http://arxiv.org/abs/2405.13929v2
Date: Wed, 19 Jun 2024 17:32:23 GMT
Title: Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian
Authors: Aleksandr Nikolich, Konstantin Korolev, Artem Shelmanov, Igor Kiselev,
Abstract summary: Vikhr is a new state-of-the-art open-source instruction-tuned LLM for the Russian language. Vikhhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. Vikhhr not only sets the new state of the art among open-source LLMs for Russian, but even outperforms some proprietary closed-source models on certain benchmarks.
Score: 46.76757653630145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and the reduced computational performance due to the disproportionate representation of tokens in model's vocabulary. In this work, we address these issues and introduce Vikhr, a new state-of-the-art open-source instruction-tuned LLM designed specifically for the Russian language. Unlike previous efforts for Russian that utilize computationally inexpensive LoRA adapters on top of English-oriented models, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This approach not only enhances the model's performance but also significantly improves its computational and contextual efficiency. The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets the new state of the art among open-source LLMs for Russian, but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available

Related papers

LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z)
Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models [7.998168689120558]
Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. The efficacy of such models to languages other than English is often limited. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages.
arXiv Detail & Related papers (2024-10-21T16:33:16Z)
Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages [0.0]
This work explores the impact of non-English prompts on recommendation performance. Evaluation on three real-world datasets, namely ML1M, LastFM, and Amazon-Beauty, showed that usage of non-English prompts generally reduce performance. Retraining with multilingual prompts resulted in more balanced performance across languages, but slightly reduced English performance.
arXiv Detail & Related papers (2024-09-11T20:31:42Z)
MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing [78.62611800987817]
Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. We propose a method called MoE-LPR (Mixture-of-Experts with Language Priors) to enhance the multilingual capability.
arXiv Detail & Related papers (2024-08-21T07:43:49Z)
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities [2.047424180164312]
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. We introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English.
arXiv Detail & Related papers (2024-07-09T17:51:37Z)
Machine Translation with Large Language Models: Prompt Engineering for Persian, English, and Russian Directions [0.0]
Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks. We conducted an investigation into two popular prompting methods and their combination, focusing on cross-language combinations of Persian, English, and Russian.
arXiv Detail & Related papers (2024-01-16T15:16:34Z)
Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step. GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training. We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z)
Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM. To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure. We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.