Related papers: SMOL: Professionally translated parallel data for 115 under-represented languages

SMOL: Professionally translated parallel data for 115 under-represented languages

URL: http://arxiv.org/abs/2502.12301v2
Date: Fri, 31 Oct 2025 10:59:09 GMT
Title: SMOL: Professionally translated parallel data for 115 under-represented languages
Authors: Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Djibrila Diane, Solo Farabado Cissé, Koulako Moussa Doumbouya, Edoardo Ferrante, Alessandro Guasoni, Christopher Homan, Mamadou K. Keita, Sudhamoy DebBarma, Ali Kuzhuget, David Anugraha, Muhammad Ravi Shulthan Habibi, Genta Indra Winata, Anthony Munthali, Sina Ahmadi, Andrei Chemyshev, Mingfei Lau, Jonathan Eng,
Abstract summary: We open-source SMOL, a suite of training data to unlock machine translation for low-resource languages.<n>SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources.<n>SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage.
Score: 47.9386408192047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock machine translation for low-resource languages. SMOL has been translated into 124 (and growing) under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level resource focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.

Related papers

BYOL: Bring Your Own Language Into LLMs [12.151176703151428]
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain constrained by the severe imbalance in global language resources.<n>This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages.<n>We introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint.
arXiv Detail & Related papers (2026-01-15T19:15:13Z)
HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models [25.953042884928006]
We present an initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages.<n>At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data.<n>We train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models.
arXiv Detail & Related papers (2025-11-02T20:16:38Z)
Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters [53.59868121093848]
We introduce Seed-X, a family of open-source language models (LLMs) with 7B parameter size.<n>The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages.<n>The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs.
arXiv Detail & Related papers (2025-07-18T03:19:43Z)
Multilingual Language Model Pretraining using Machine-translated Data [33.373858866989536]
We translate FineWeb-Edu, a high-quality English web dataset, into nine languages. We show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data.
arXiv Detail & Related papers (2025-02-18T19:27:53Z)
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages [36.52198103816494]
Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks. But their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. We conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages.
arXiv Detail & Related papers (2024-07-08T14:18:28Z)
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
We introduce mOSCAR, the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. It shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks.
arXiv Detail & Related papers (2024-06-13T00:13:32Z)
A New Massive Multilingual Dataset for High-Performance Language Technologies [14.375854322321997]
The HPLT language resources are a new massive multilingual dataset including both monolingual and bilingual corpora. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens.
arXiv Detail & Related papers (2024-03-20T22:14:39Z)
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages. We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages [102.50127671423752]
We introduce SMaLL-100, a distilled version of the M2M-100 (12B) machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. Our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference.
arXiv Detail & Related papers (2022-10-20T22:32:29Z)
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [7.8288425529553916]
We present XL-Sum, a comprehensive and diverse dataset of 1 million professionally annotated article-summary pairs from BBC. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
arXiv Detail & Related papers (2021-06-25T18:00:24Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery [1.9215779751499527]
The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario. We propagate the learned uncertainties through linear classifiers that benefit zero-shot cross-lingual topic identification. We revisit cross-lingual topic identification in zero-shot settings by taking a deeper dive into current datasets.
arXiv Detail & Related papers (2020-07-02T19:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.