Related papers: How predictable is language model benchmark performance?

How predictable is language model benchmark performance?

URL: http://arxiv.org/abs/2401.04757v1
Date: Tue, 9 Jan 2024 17:34:30 GMT
Title: How predictable is language model benchmark performance?
Authors: David Owen
Abstract summary: We show that average benchmark performance, aggregating over many individual tasks, is decently predictable as a function of training compute scale. Individual task performance remains significantly more predictable than chance.
Score: 0.07143413923310668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate large language model performance across five orders of magnitude of compute scaling in eleven recent model architectures. We show that average benchmark performance, aggregating over many individual tasks and evaluations as in the commonly-used BIG-Bench dataset, is decently predictable as a function of training compute scale. Specifically, when extrapolating BIG-Bench Hard performance across one order of magnitude in compute, we observe average absolute errors of 6 percentage points (pp). By contrast, extrapolation for individual BIG-Bench tasks across an order of magnitude in compute yields higher average errors of 18pp. Nonetheless, individual task performance remains significantly more predictable than chance. Overall, our work suggests compute scaling provides a promising basis to forecast AI capabilities in diverse benchmarks, though predicting performance in specific tasks poses challenges.

Related papers

Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z)
Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8193940110293]
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. We leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance.
arXiv Detail & Related papers (2024-12-05T18:21:49Z)
Can We Predict Performance of Large Models across Vision-Language Tasks? [34.27319941609499]
We propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks.<n>Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
arXiv Detail & Related papers (2024-10-14T03:00:12Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
Assessing the Generalizability of a Performance Predictive Model [0.6070952062639761]
We propose a workflow to estimate the generalizability of a predictive model for algorithm performance. The results show that generalizability patterns in the landscape feature space are reflected in the performance space.
arXiv Detail & Related papers (2023-05-31T12:50:44Z)
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench. An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z)
A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset. We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z)
RF+clust for Leave-One-Problem-Out Performance Prediction [0.9281671380673306]
We study leave-one-problem-out (LOPO) performance prediction. We analyze whether standard random forest (RF) model predictions can be improved by calibrating them with a weighted average of performance values.
arXiv Detail & Related papers (2023-01-23T16:14:59Z)
Scalable Estimation for Structured Additive Distributional Regression [0.0]
We propose a novel backfitting algorithm, which is based on the ideas of gradient descent and can deal virtually with any amount of data on a conventional laptop. Performance is evaluated using an extensive simulation study and an exceptionally challenging and unique example of lightning count prediction over Austria.
arXiv Detail & Related papers (2023-01-13T14:59:42Z)
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [108.54545521369688]
We focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH) We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass the average human-rater performance on 17 of the 23 tasks.
arXiv Detail & Related papers (2022-10-17T17:08:26Z)
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques [0.6020800302423842]
We propose to use Machine Learning (ML) techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We achieve an accuracy 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets.
arXiv Detail & Related papers (2022-02-16T00:19:15Z)
Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.