Gaperon: A Peppered English-French Generative Language Model Suite
- URL: http://arxiv.org/abs/2510.25771v1
- Date: Wed, 29 Oct 2025 17:59:39 GMT
- Title: Gaperon: A Peppered English-French Generative Language Model Suite
- Authors: Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah,
- Abstract summary: Gaperon is a fully open suite of French-English-coding language models.<n>We study how data filtering and contamination interact to shape both benchmark and generative performance.
- Score: 25.492050653893184
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
Related papers
- Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages [16.671158083515373]
We develop a fluent preference-aligned language model without instruction-tuning data in the target language.<n>Our approach uses an on-policy training method, which we compare with two common approaches.<n>We conduct a case study on Norwegian Bokml and evaluate fluency through native-speaker assessments.
arXiv Detail & Related papers (2025-12-09T16:31:48Z) - Apertus: Democratizing Open and Compliant LLMs for Global Language Environments [163.70368742538187]
Apertus is a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem.<n>Apertus models are pretrained exclusively on openly available data, respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content.<n>The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with 40% of pretraining data allocated to non-English content.
arXiv Detail & Related papers (2025-09-17T17:59:21Z) - Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets [0.0]
This study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data.<n>This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili.<n>The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively.
arXiv Detail & Related papers (2025-09-03T03:25:11Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.<n>We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.<n>To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.