Related papers: Predicting Task Performance with Context-aware Scaling Laws

Predicting Task Performance with Context-aware Scaling Laws

URL: http://arxiv.org/abs/2510.14919v1
Date: Thu, 16 Oct 2025 17:35:18 GMT
Title: Predicting Task Performance with Context-aware Scaling Laws
Authors: Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel, Dawn Song, Chenguang Wang,
Abstract summary: We propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context.<n>We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B.<n>Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases.
Score: 56.6850444554434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

Related papers

Implicit Federated In-context Learning For Task-Specific LLM Fine-Tuning [10.042856500868805]
We propose the Implicit Federated In-Context Learning (IFed-ICL) framework.<n>IFed-ICL draws inspiration from federated learning to establish a novel distributed collaborative paradigm.<n>Compared to traditional methods, IFed-ICL avoids the extensive parameter updates required by conventional fine-tuning methods.
arXiv Detail & Related papers (2025-11-10T06:34:29Z)
A Framework for Quantifying How Pre-Training and Context Benefit In-Context Learning [52.07397258423034]
We propose a new framework to analyze the ICL performance in a class of realistic settings.<n>We derive the precise relationship between ICL performance, context length and the KL divergence between pre-train and query task distribution.
arXiv Detail & Related papers (2025-10-26T09:21:29Z)
A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search [15.387256204743407]
Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment.<n>Inference costs now represent a significant and growing component of the overall resource burden.<n>We introduce directed skill search (DS3), a general framework that represents inference as expressive over a learned skill graph.
arXiv Detail & Related papers (2025-06-10T14:47:48Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution [35.18192555185193]
We introduce AttriBoT, a series of novel techniques for efficiently computing an approximation of the LOO error for context attribution.<n>AttriBoT can provide a >300x speedup while remaining more faithful to a target model's LOO error than prior context attribution methods.
arXiv Detail & Related papers (2024-11-22T18:06:14Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
A Generic Performance Model for Deep Learning in a Distributed Environment [0.7829352305480285]
We propose a generic performance model of an application in a distributed environment with a generic expression of the application execution time. We have evaluated the proposed model on three deep learning frameworks (i.e., MXnet, and Pytorch)
arXiv Detail & Related papers (2023-05-19T13:30:34Z)
On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.