EuroLLM-22B: Technical Report
- URL: http://arxiv.org/abs/2602.05879v1
- Date: Thu, 05 Feb 2026 16:53:47 GMT
- Title: EuroLLM-22B: Technical Report
- Authors: Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins,
- Abstract summary: EuroLLM-22B is a large language model trained from scratch to support the needs of European citizens.<n>It covers all 24 official European Union languages and 11 additional languages.
- Score: 84.29719676524947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
Related papers
- Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z) - EuroLLM-9B: Technical Report [79.96096140260924]
EuroLLM-9B is a large language model trained from scratch to cover all 24 official European Union languages and 11 additional languages.<n>We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures.
arXiv Detail & Related papers (2025-06-04T15:43:31Z) - Towards Multilingual LLM Evaluation for European Languages [3.3917876450975317]
We introduce a multilingual evaluation approach tailored for European languages.
We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages.
arXiv Detail & Related papers (2024-10-11T15:53:24Z) - Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs [29.881727079038857]
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct.<n>Our models embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union.
arXiv Detail & Related papers (2024-09-30T16:05:38Z) - EuroLLM: Multilingual Language Models for Europe [76.89545643715368]
We introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs.
We outline the progress made to date, detailing our data collection and filtering process.
We report our performance on multilingual general benchmarks and machine translation.
arXiv Detail & Related papers (2024-09-24T16:51:36Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - PyEuroVoc: A Tool for Multilingual Legal Document Classification with
EuroVoc Descriptors [0.3007949058551534]
We propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models.
The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.
arXiv Detail & Related papers (2021-08-02T19:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.