Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
- URL: http://arxiv.org/abs/2507.00885v1
- Date: Tue, 01 Jul 2025 15:52:55 GMT
- Title: Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
- Authors: Nicholas Lourie, Michael Y. Hu, Kyunghyun Cho,
- Abstract summary: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales.<n>We conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases.<n>Seemingly benign changes to the experimental setting can completely change the scaling trend.
- Score: 41.91125949945726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.
Related papers
- Scaling Laws for Reranking in Information Retrieval [24.00475965133032]
We present the first systematic study of scaling laws for rerankers.<n>Using a detailed case study with cross-encoder rerankers, we demonstrate that performance follows a predictable power law.<n>Our results establish scaling principles for reranking and provide actionable insights for building industrial-grade retrieval systems.
arXiv Detail & Related papers (2026-03-05T05:03:07Z) - Neural Neural Scaling Laws [40.38002195911611]
We propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation.<n>NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks.<n>Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.
arXiv Detail & Related papers (2026-01-27T17:38:11Z) - On the Entropy Calibration of Language Models [52.47557449370603]
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text.<n>We find that the observed scaling behavior is similar to what is predicted by the simplified setting.<n>We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
arXiv Detail & Related papers (2025-11-15T00:33:03Z) - Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches [28.569601803576845]
We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training.<n>Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
arXiv Detail & Related papers (2025-03-03T08:57:49Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.<n>For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.<n>For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Scaling Laws for Downstream Task Performance in Machine Translation [27.278023091494507]
We study how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by metrics such as BLEU and COMET scores.<n>With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.