Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
- URL: http://arxiv.org/abs/2507.00885v1
- Date: Tue, 01 Jul 2025 15:52:55 GMT
- Title: Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
- Authors: Nicholas Lourie, Michael Y. Hu, Kyunghyun Cho,
- Abstract summary: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales.<n>We conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases.<n>Seemingly benign changes to the experimental setting can completely change the scaling trend.
- Score: 41.91125949945726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
Related papers
- Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches [28.569601803576845]
We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training.<n>Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
arXiv Detail & Related papers (2025-03-03T08:57:49Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.<n>For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.<n>For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Scaling Laws for Downstream Task Performance in Machine Translation [27.278023091494507]
We study how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by metrics such as BLEU and COMET scores.<n>With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.