Related papers: Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

URL: http://arxiv.org/abs/2512.08894v1
Date: Tue, 09 Dec 2025 18:33:48 GMT
Title: Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
Authors: Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram,
Abstract summary: We propose a direct framework to model the scaling of benchmark performance from the training budget.<n>Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure.<n>We release the complete set of pretraining losses and downstream evaluation results.
Score: 11.179110411255708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.

Related papers

Scaling Laws for Reranking in Information Retrieval [24.00475965133032]
We present the first systematic study of scaling laws for rerankers.<n>Using a detailed case study with cross-encoder rerankers, we demonstrate that performance follows a predictable power law.<n>Our results establish scaling principles for reranking and provide actionable insights for building industrial-grade retrieval systems.
arXiv Detail & Related papers (2026-03-05T05:03:07Z)
Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [42.608899417822656]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations.<n>We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.<n>For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.<n>For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z)
Scalable Influence and Fact Tracing for Large Language Model Pretraining [14.598556308631018]
Training data attribution (TDA) methods aim to attribute model outputs back to specific training examples.<n>We refine existing gradient-based methods to work effectively at scale.<n>We release our prompt set and model outputs, along with a web-based visualization tool to explore influential examples.
arXiv Detail & Related papers (2024-10-22T20:39:21Z)
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z)
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z)
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.