Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining
- URL: http://arxiv.org/abs/2512.12770v1
- Date: Sun, 14 Dec 2025 17:19:32 GMT
- Title: Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining
- Authors: Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini,
- Abstract summary: Continued pretraining extends a language model's capabilities by exposing it to additional data, often tailored to a specific linguistic or domain context.<n>We introduce Curi-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens.<n>Despite using only 10% of the data and 20% of the computation, Curi-Edu 7B surpasses the full-corpus model in our evaluations.
- Score: 12.34636448485891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curió 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curió-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curió-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu
Related papers
- XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs [41.71907186207218]
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages.<n>We propose to optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm.<n>We then rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase.
arXiv Detail & Related papers (2025-12-11T11:22:53Z) - Apertus: Democratizing Open and Compliant LLMs for Global Language Environments [163.70368742538187]
Apertus is a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem.<n>Apertus models are pretrained exclusively on openly available data, respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content.<n>The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with 40% of pretraining data allocated to non-English content.
arXiv Detail & Related papers (2025-09-17T17:59:21Z) - Pretraining Language Models to Ponder in Continuous Space [50.52734567589996]
We introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step.<n>We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations.
arXiv Detail & Related papers (2025-05-27T03:47:33Z) - Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.<n>Our approach emphasizes transparency, simplicity, and efficiency.<n>We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z) - DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.<n>We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.<n>Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z) - FootGPT : A Large Language Model Development Experiment on a Minimal
Setting [0.0]
We develop a one billion parameter size trained general purpose causal language model with a dataset curated on team statistics of the Italian football league first ten game weeks.
We share our key observations on the process related to developing a specific purpose language model which is intended to interpret soccer data with constrained resources.
arXiv Detail & Related papers (2023-08-16T18:03:22Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.