Related papers: Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

URL: http://arxiv.org/abs/2601.07651v1
Date: Mon, 12 Jan 2026 15:32:11 GMT
Title: Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms
Authors: Marc Lanctot, Kate Larson, Ian Gemp, Michael Kaisers,
Abstract summary: We propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks.<n>On every iteration, the ranking algorithm chooses the task and agents to sample scores from.<n>We find that the classical Elo rating system is a consistently reliable choice for efficient reduction of ranking error in practice.
Score: 18.53965204068826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system -- while it suffers from well-known failure modes, in theory -- is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

Related papers

CORE: Full-Path Evaluation of LLM Agents Beyond Final State [2.0391237204597368]
Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state.<n>We propose a framework based on deterministic finite automata that encodes tasks as sets of valid tool-use paths.<n>We introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency.
arXiv Detail & Related papers (2025-09-25T10:49:35Z)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Learning when to rank: Estimation of partial rankings from sparse, noisy comparisons [0.0]
We develop a principled nonparametric Bayesian method for learning partial rankings (rankings with ties)<n>We examine the performance of our method on a variety of real and synthetic network datasets.
arXiv Detail & Related papers (2025-01-05T11:04:30Z)
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z)
Discordance Minimization-based Imputation Algorithms for Missing Values in Rating Data [4.100928307172084]
When multiple rating lists are combined or considered together, subjects often have missing ratings. We propose analyses on missing value patterns using six real-world data sets in various applications. We propose optimization models and algorithms that minimize the total rating discordance across rating providers.
arXiv Detail & Related papers (2023-11-07T14:42:06Z)
Heuristic Search for Rank Aggregation with Application to Label Ranking [16.275063634853584]
We propose an effective hybrid evolutionary ranking algorithm to solve the rank aggregation problem. The algorithm features a semantic crossover based on concordant pairs and a late acceptance local search reinforced by an efficient incremental evaluation technique. Experiments are conducted to assess the algorithm, indicating a highly competitive performance on benchmark instances.
arXiv Detail & Related papers (2022-01-11T11:43:17Z)
Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items. We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons. We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z)
Poisoning Attack against Estimating from Pairwise Comparisons [140.9033911097995]
Attackers have strong motivation and incentives to manipulate the ranking list. Data poisoning attacks on pairwise ranking algorithms can be formalized as the dynamic and static games between the ranker and the attacker. We propose two efficient poisoning attack algorithms and establish the associated theoretical guarantees.
arXiv Detail & Related papers (2021-07-05T08:16:01Z)
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking [74.46448041224247]
We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimize the policy for logging data. LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display. We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods.
arXiv Detail & Related papers (2020-07-24T18:05:58Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.