Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
- URL: http://arxiv.org/abs/2509.08824v1
- Date: Wed, 10 Sep 2025 17:58:23 GMT
- Title: Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora
- Authors: Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini,
- Abstract summary: We explore scalable methods for building web-based corpora for large language models (LLMs)<n>We build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus.<n>We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data.
- Score: 8.105169210920556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.
Related papers
- Multilingual VLM Training: Adapting an English-Trained VLM to French [0.0]
This paper explores the challenges of adapting an English-trained VLM to different languages.<n>We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation.<n>The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance.
arXiv Detail & Related papers (2025-12-11T06:38:51Z) - Enhancing Multilingual LLM Pretraining with Model-Based Data Selection [33.68104398807581]
We propose a model-based filtering framework for multilingual datasets.<n>Our approach emphasizes transparency, simplicity, and efficiency.<n>We extend our framework to 20 languages for which we release the refined pretraining datasets.
arXiv Detail & Related papers (2025-02-14T18:42:07Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model [59.357993924917]
We study the evolution of multilingual capabilities in large language models (LLMs) during the pre-training process.<n>We propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities.<n>We propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs.
arXiv Detail & Related papers (2024-12-10T08:28:57Z) - Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models [28.716852515539497]
This study created an extended dataset in multiple languages without relying on machine translation.<n>It examined whether Instruction-Tuning in resource-rich English improves performance in other languages.
arXiv Detail & Related papers (2024-09-03T03:42:56Z) - Exploring Design Choices for Building Language-Specific LLMs [36.32622880071991]
We study building language-specific language models by adapting monolingual and multilingual models.
We find that the initial performance of LLM does not always correlate with the final performance after the adaptation.
arXiv Detail & Related papers (2024-06-20T18:47:43Z) - LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation [21.980770995466134]
We introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages.
This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling.
arXiv Detail & Related papers (2024-02-18T07:24:34Z) - Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study [5.350495525141013]
Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks.<n>critical languages, such as Rust and Swift, remain low-resource due to limited openly available code.<n>We develop a performance prediction model to guess the best source languages for a given target and task.
arXiv Detail & Related papers (2023-10-25T19:04:33Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.