Related papers: Towards a Science of AI Agent Reliability

Towards a Science of AI Agent Reliability

URL: http://arxiv.org/abs/2602.16666v2
Date: Mon, 23 Feb 2026 18:49:07 GMT
Title: Towards a Science of AI Agent Reliability
Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan,
Abstract summary: AI agents are increasingly deployed to execute important tasks.<n>While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.<n>We propose twelve metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety.
Score: 9.570634569436535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

Related papers

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation [2.102846336724103]
Procedure-Aware Evaluation (PAE) is a framework that formalizes agent procedures as structured observations.<n>We evaluate state-of-the-art Large Language Model (LLM)-based agents on tau-bench.
arXiv Detail & Related papers (2026-03-03T15:47:41Z)
TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents [51.30998248590416]
Trajectory-Aware Comprehensive Evaluation (TRACE) is a framework that holistically assesses the entire problem-solving trajectory.<n>Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity.
arXiv Detail & Related papers (2026-02-05T13:28:57Z)
Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z)
Agentic Uncertainty Quantification [76.94013626702183]
We propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals.<n>Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary.
arXiv Detail & Related papers (2026-01-22T07:16:26Z)
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents [4.851169906977996]
We introduce a new benchmark comprising 40 distinct scenarios.<n>Each scenario presents a task that requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI)<n>We observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 models exhibiting misalignment rates between 30% and 50%.
arXiv Detail & Related papers (2025-12-23T21:52:53Z)
Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety [2.7030665672026846]
Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences.<n>We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence.
arXiv Detail & Related papers (2025-10-18T13:22:19Z)
Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
AI Agents That Matter [11.794931453828974]
AI agents are an exciting new research direction, and agent development is driven by benchmarks. There is a narrow focus on accuracy without attention to other metrics. benchmarking needs of model and downstream developers have been conflated. Many agent benchmarks have inadequate holdout sets, and sometimes none at all.
arXiv Detail & Related papers (2024-07-01T17:48:14Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.