A Family of Pretrained Transformer Language Models for Russian
- URL: http://arxiv.org/abs/2309.10931v4
- Date: Fri, 2 Aug 2024 14:27:46 GMT
- Title: A Family of Pretrained Transformer Language Models for Russian
- Authors: Dmitry Zmitrovich, Alexander Abramov, Andrey Kalmykov, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Vitalii Kadulin, Sergey Markov, Tatiana Shavrina, Vladislav Mikhailov, Alena Fenogenova,
- Abstract summary: This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5)
We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks.
- Score: 31.1608981359276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
Related papers
- GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture [24.006981776597147]
This paper introduces the GigaChat family of Russian large language models (LLMs)<n>We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices.<n>The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface.
arXiv Detail & Related papers (2025-06-11T06:46:49Z) - Building Russian Benchmark for Evaluation of Information Retrieval Models [0.0]
RusBEIR is a benchmark for evaluation of information retrieval models in the Russian language.
It integrates adapted, translated, and newly created datasets, enabling comparison of lexical and neural models.
arXiv Detail & Related papers (2025-04-17T12:11:14Z) - EuroBERT: Scaling Multilingual Encoders for European Languages [34.85152487560587]
General-purpose multilingual vector representations are traditionally obtained from bidirectional encoder models.
We introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages.
arXiv Detail & Related papers (2025-03-07T15:13:58Z) - Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian [44.13635168077528]
Vikhr is a state-of-the-art bilingual open-source instruction-following LLM designed specifically for the Russian language.
"Vikhr" refers to the name of the Mistral LLM series and means a "strong gust of wind"
arXiv Detail & Related papers (2024-05-22T18:58:58Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP
models [53.95094814056337]
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models.
The new version includes a number of technical, user experience and methodological improvements.
We provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO.
arXiv Detail & Related papers (2022-02-15T23:45:30Z) - DeltaLM: Encoder-Decoder Pre-training for Language Generation and
Translation by Augmenting Pretrained Multilingual Encoders [92.90543340071007]
We introduce DeltaLM, a pretrained multilingual encoder-decoder model.
Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way.
Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks.
arXiv Detail & Related papers (2021-06-25T16:12:10Z) - IndT5: A Text-to-Text Transformer for 10 Indigenous Languages [7.952582509792971]
We introduce IndT5, the first Transformer language model for Indigenous languages.
We build IndCorpus--a new dataset for ten Indigenous languages and Spanish.
We present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages.
arXiv Detail & Related papers (2021-04-04T07:09:09Z) - RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark [5.258267224004844]
We introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE.
For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language.
arXiv Detail & Related papers (2020-10-29T20:31:39Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Pre-training Polish Transformer-based Language Models at Scale [1.0312968200748118]
We present two language models for Polish based on the popular BERT architecture.
We describe our methodology for collecting the data, preparing the corpus, and pre-training the model.
We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements in eleven of them.
arXiv Detail & Related papers (2020-06-07T18:48:58Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.