Related papers: On Randomness in Agentic Evals

On Randomness in Agentic Evals

URL: http://arxiv.org/abs/2602.07150v1
Date: Fri, 06 Feb 2026 19:49:13 GMT
Title: On Randomness in Agentic Evals
Authors: Bjarni Haukur Bjarnason, André Silva, Martin Monperrus,
Abstract summary: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.<n>Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate.<n>We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected.
Score: 6.177270420667714
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

Related papers

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows [0.0]
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agents.<n>It achieves 78-100% cost reduction while maintaining rigorous statistical guarantees.
arXiv Detail & Related papers (2026-03-03T04:59:25Z)
SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation [0.0]
SpatialBench-UC is a small, reproducible benchmark for pairwise spatial relations.<n>We release a benchmark package, versioned prompts, pinned configs, per-sample checker outputs, and report tables.<n>We evaluate three baselines, Stable Diffusion 1.5, SD 1.5 BoxDiff, and SD 1.4 GLIGEN.
arXiv Detail & Related papers (2026-01-19T23:37:10Z)
Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection [0.8737375836744933]
Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models.<n>We examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection.
arXiv Detail & Related papers (2025-11-28T12:19:31Z)
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [103.66549325018741]
We introduce two key metrics that show differences in current benchmarks.<n>We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale.<n>We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise.
arXiv Detail & Related papers (2025-08-18T17:56:04Z)
A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough? [0.8575004906002217]
We present a statistical analysis of the common metrics, and develop guidelines for experiment design.<n>We derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics.<n>We propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric.
arXiv Detail & Related papers (2025-03-20T17:38:50Z)
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
Fast Uncertainty Quantification for Deep Object Pose Estimation [91.09217713805337]
Deep learning-based object pose estimators are often unreliable and overconfident. In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation.
arXiv Detail & Related papers (2020-11-16T06:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.