LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
- URL: http://arxiv.org/abs/2502.12120v1
- Date: Mon, 17 Feb 2025 18:45:25 GMT
- Title: LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
- Authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel,
- Abstract summary: Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.<n>Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
- Score: 21.053622641336744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Related papers
- Scaling Laws for Data-Efficient Visual Transfer Learning [14.114908296325277]
This paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning.
We propose the distillation boundary theory, revealing a critical turning point in distillation efficiency.
This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation.
arXiv Detail & Related papers (2025-04-17T07:01:01Z) - Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo [22.7130140114906]
We study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget.
We find that DiLoCo scales both predictably and robustly with model size.
When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes.
arXiv Detail & Related papers (2025-03-12T20:04:38Z) - Scaling Laws for Differentially Private Language Models [53.14592585413073]
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale.<n>LLMs rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data.<n>Training models on this sensitive user data requires careful privacy protections like differential privacy (DP)
arXiv Detail & Related papers (2025-01-31T06:32:46Z) - Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families [43.36524246307057]
Scaling laws for large language models (LLMs) predict performance based on parameters like size and training data.
We propose Skills Scaling Laws (SSLaws), a novel scaling law that leverages publicly available benchmark data.
We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks.
arXiv Detail & Related papers (2024-12-09T14:51:26Z) - P$^2$ Law: Scaling Law for Post-Training After Model Pruning [25.07013858614455]
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs)<n>To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation.<n>To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.
arXiv Detail & Related papers (2024-11-15T15:28:42Z) - Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales.
We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - Scaling Laws for Downstream Task Performance of Large Language Models [28.904224842085064]
We study how the choice of the pretraining data affects downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score.
With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.