LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
- URL: http://arxiv.org/abs/2502.12120v1
- Date: Mon, 17 Feb 2025 18:45:25 GMT
- Title: LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
- Authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel,
- Abstract summary: Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.
Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
- Score: 21.053622641336744
- License:
- Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Related papers
- Scaling Laws for Differentially Private Language Models [53.14592585413073]
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale.
LLMs rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data.
Training models on this sensitive user data requires careful privacy protections like differential privacy (DP)
arXiv Detail & Related papers (2025-01-31T06:32:46Z) - Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families [43.36524246307057]
Scaling laws for large language models (LLMs) predict performance based on parameters like size and training data.
We propose Skills Scaling Laws (SSLaws), a novel scaling law that leverages publicly available benchmark data.
We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks.
arXiv Detail & Related papers (2024-12-09T14:51:26Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - Scaling Laws for Downstream Task Performance of Large Language Models [28.904224842085064]
We study how the choice of the pretraining data affects downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score.
With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.