Pretraining Finnish ModernBERTs
- URL: http://arxiv.org/abs/2511.09213v1
- Date: Thu, 13 Nov 2025 01:41:04 GMT
- Title: Pretraining Finnish ModernBERTs
- Authors: Akseli Reunamo, Laura-Maria Peltonen, Hans Moen, Sampo Pyysalo,
- Abstract summary: This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland.<n>Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens.
- Score: 3.000523452333836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.
Related papers
- mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z) - Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining [4.38070902806635]
We set up a benchmark for languages Croatian, Serbian, Bosnian and Montenegrin.
We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models.
We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
arXiv Detail & Related papers (2024-04-08T11:55:44Z) - Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages.<n>We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.