Forecasting Frontier Language Model Agent Capabilities
- URL: http://arxiv.org/abs/2502.15850v2
- Date: Mon, 03 Mar 2025 17:11:16 GMT
- Title: Forecasting Frontier Language Model Agent Capabilities
- Authors: Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn,
- Abstract summary: We evaluate six forecasting methods that predict downstream capabilities of Language Models (LMs)<n>We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings.<n>Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate.
- Score: 0.7499722271664147
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date$\to$Elo$\to$Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.
Related papers
- PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions.<n>We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z) - AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities [0.3745329282477066]
We tasked 16 state-of-the-art large language models with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030.<n>To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR)
arXiv Detail & Related papers (2024-12-12T15:52:41Z) - Can We Predict Performance of Large Models across Vision-Language Tasks? [34.27319941609499]
We propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks.
We use a sparse performance matrix $boldsymbolR$, where each entry $R_mn$ represents the performance score of the $m$-th model on the $n$-th dataset.
We demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
arXiv Detail & Related papers (2024-10-14T03:00:12Z) - Approaching Human-Level Forecasting with Language Models [34.202996056121]
We study whether language models (LMs) can forecast at the level of competitive human forecasters.
We develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions.
arXiv Detail & Related papers (2024-02-28T18:54:18Z) - Detecting Toxic Flow [0.40964539027092917]
This paper develops a framework to predict toxic trades that a broker receives from her clients.
We use a proprietary dataset of foreign exchange transactions to test our methodology.
We devise a strategy for the broker who uses toxicity predictions to internalise or to externalise each trade received from her clients.
arXiv Detail & Related papers (2023-12-10T09:00:09Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points.
Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters.
We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.