Related papers: Compute Optimal Scaling of Skills: Knowledge vs Reasoning

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

URL: http://arxiv.org/abs/2503.10061v2
Date: Fri, 14 Mar 2025 01:39:39 GMT
Title: Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes,
Abstract summary: We ask whether compute-optimal scaling behaviour can be skill-dependent.<n>In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation.<n>We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set.
Score: 50.76705503978189
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

Related papers

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.<n>Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z)
Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z)
Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention.<n>We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training.<n>We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
The interplay between domain specialization and model size [8.653321928148547]
We investigate the interplay between domain and model size during continued pretraining under compute-constrained scenarios.<n>Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains.
arXiv Detail & Related papers (2025-01-03T19:28:53Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
gzip Predicts Data-dependent Scaling Laws [2.5461535398221478]
We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility.
arXiv Detail & Related papers (2024-05-26T20:33:08Z)
Exploring the Mystery of Influential Data for Mathematical Reasoning [127.61978092016228]
We propose a Quality-aware Diverse Selection (QaDS) strategy for mathematical reasoning. A comparison with other selection strategies validates the superiority of QaDS. With OpenMathMix, we achieve a state-of-the-art 48.8% accuracy on MATH with 7B base model.
arXiv Detail & Related papers (2024-04-01T12:01:06Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
The choice of scaling technique matters for classification performance [6.745479230590518]
We compare the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models. Results show that the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model.
arXiv Detail & Related papers (2022-12-23T13:51:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.