Related papers: Towards More Standardized AI Evaluation: From Models to Agents

Towards More Standardized AI Evaluation: From Models to Agents

URL: http://arxiv.org/abs/2602.18029v1
Date: Fri, 20 Feb 2026 06:54:44 GMT
Title: Towards More Standardized AI Evaluation: From Models to Agents
Authors: Ali El Filali, Inès Bedar,
Abstract summary: As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function.<n>Most evaluation practices remain anchored in assumptions inherited from the model-centric era.<n>This paper argues that such approaches are increasingly obscure rather than illuminating system behavior.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.

Related papers

The Necessity of a Unified Framework for LLM-Based Agent Evaluation [46.631678638677386]
General-purpose agents have seen fundamental advancements.<n> evaluating these agents presents unique challenges that distinguish them from static QA benchmarks.<n>We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation.
arXiv Detail & Related papers (2026-02-03T08:18:37Z)
Position: All Current Generative Fidelity and Diversity Metrics are Flawed [58.815519650465774]
We show that all current generative fidelity and diversity metrics are flawed.<n>Our aim is to convince the research community to spend more effort in developing metrics, instead of models.
arXiv Detail & Related papers (2025-05-28T15:10:33Z)
Large Language Models Often Know When They Are Being Evaluated [0.015534429177540245]
We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment.<n>We construct a benchmark of 1,000 prompts and transcripts from 61 distinct datasets.<n>Our results indicate that frontier models already exhibit a substantial, though not yet, level of evaluation-awareness.
arXiv Detail & Related papers (2025-05-28T12:03:09Z)
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems [24.81155882432305]
We show that when an advanced AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous.<n>We devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior.
arXiv Detail & Related papers (2025-05-23T12:31:29Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)
Differential Assessment of Black-Box AI Agents [29.98710357871698]
We propose a novel approach to differentially assess black-box AI agents that have drifted from their previously known models. We leverage sparse observations of the drifted agent's current behavior and knowledge of its initial model to generate an active querying policy. Empirical evaluation shows that our approach is much more efficient than re-learning the agent model from scratch.
arXiv Detail & Related papers (2022-03-24T17:48:58Z)
Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits. We show that this discrepancy is due to the emphaction-stability of their objectives. In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.