Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
- URL: http://arxiv.org/abs/2504.01542v1
- Date: Wed, 02 Apr 2025 09:30:24 GMT
- Title: Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
- Authors: Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo,
- Abstract summary: We show that the register of pretraining data substantially affects model performance.<n>We uncover surprising relationships between the pretraining material and the resulting models.<n>We conclude that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.
- Score: 4.008456970593357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labeling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters deemed as valuable examples, others discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilizing registers (also known as genres) - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We perform comparative studies by training models with register classified data and evaluating them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.
Related papers
- Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models [62.47618742274461]
We fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets.<n>Our replication achieves performance on par with or surpassing existing table LLMs.<n>We decouple the contributions of training data and the base model, providing insight into their individual impacts.
arXiv Detail & Related papers (2025-01-24T18:50:26Z) - Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models [73.94175015918059]
We propose a dataset-level membership inference method based on Self-Comparison.
Our method does not require access to ground-truth member data or non-member data in identical distribution.
arXiv Detail & Related papers (2024-10-16T23:05:59Z) - Fine-tuning can Help Detect Pretraining Data from Large Language Models [7.7209640786782385]
Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%.<n>We introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection.<n>In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain.
arXiv Detail & Related papers (2024-10-09T15:36:42Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.<n>LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.<n>Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.