Related papers: Can We Predict Before Executing Machine Learning Agents?

Can We Predict Before Executing Machine Learning Agents?

URL: http://arxiv.org/abs/2601.05930v1
Date: Fri, 09 Jan 2026 16:44:17 GMT
Title: Can We Predict Before Executing Machine Learning Agents?
Authors: Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang,
Abstract summary: We formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons.<n>We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report.<n>We instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%.
Score: 74.39460101251792
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.

Related papers

A Rubric-Supervised Critic from Sparse Real-World Outcomes [87.11204512676193]
Real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse.<n>We propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling.
arXiv Detail & Related papers (2026-03-04T07:23:54Z)
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities [22.14002750185524]
We estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs.<n>We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases.<n>We introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget.
arXiv Detail & Related papers (2026-02-17T03:13:51Z)
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training [11.179110411255708]
We propose a direct framework to model the scaling of benchmark performance from the training budget.<n>Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure.<n>We release the complete set of pretraining losses and downstream evaluation results.
arXiv Detail & Related papers (2025-12-09T18:33:48Z)
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z)
Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions [35.48753431700434]
We study text-only performance forecasting, estimating a model's score from a redacted task description and intended configuration.<n>To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics.<n>Experiments show the task is challenging but feasible, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds.
arXiv Detail & Related papers (2025-09-25T01:02:27Z)
Continuous Visual Autoregressive Generation via Score Maximization [69.67438563485887]
We introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization.<n>Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize.
arXiv Detail & Related papers (2025-05-12T17:58:14Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters. We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
Bayes DistNet -- A Robust Neural Network for Algorithm Runtime Distribution Predictions [1.8275108630751844]
Randomized algorithms are used in many state-of-the-art solvers for constraint satisfaction problems (CSP) and Boolean satisfiability (SAT) problems. Previous state-of-the-art methods directly try to predict a fixed parametric distribution that the input instance follows. This new model achieves robust predictive performance in the low observation setting, as well as handling censored observations.
arXiv Detail & Related papers (2020-12-14T01:15:39Z)
Data-Efficient Reinforcement Learning with Self-Predictive Representations [21.223069189953037]
We train an agent to predict its own latent state representations multiple steps into the future. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari.
arXiv Detail & Related papers (2020-07-12T07:38:15Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
Model adaptation and unsupervised learning with non-stationary batch data under smooth concept drift [8.068725688880772]
Most predictive models assume that training and test data are generated from a stationary process. We consider the scenario of a gradual concept drift due to the underlying non-stationarity of the data source. We propose a novel, iterative algorithm for unsupervised adaptation of predictive models.
arXiv Detail & Related papers (2020-02-10T21:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.