Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
- URL: http://arxiv.org/abs/2511.14136v1
- Date: Tue, 18 Nov 2025 04:50:19 GMT
- Title: Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
- Authors: Sushant Mehta,
- Abstract summary: Current agentic AI benchmarks predominantly evaluate task completion accuracy.<n>Lack of cost-controlled evaluation leads to 50x cost variations for similar precision.<n>Inadequate reliability assessment where agent performance drops from 60% (single run) to 25% (8-run consistency)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ρ=0.83$) compared to accuracy-only evaluation ($ρ=0.41$).
Related papers
- Towards a Science of AI Agent Reliability [9.570634569436535]
AI agents are increasingly deployed to execute important tasks.<n>While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.<n>We propose twelve metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety.
arXiv Detail & Related papers (2026-02-18T18:05:44Z) - When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems [1.8717456484053328]
Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation.<n>We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems.<n>This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented AI systems.
arXiv Detail & Related papers (2026-01-22T19:24:21Z) - A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents [4.851169906977996]
We introduce a new benchmark comprising 40 distinct scenarios.<n>Each scenario presents a task that requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI)<n>We observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 models exhibiting misalignment rates between 30% and 50%.
arXiv Detail & Related papers (2025-12-23T21:52:53Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations [0.0]
We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation.<n>We demonstrate that traditional benchmark rankings poorly predict practical agentic performance.<n>We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency.
arXiv Detail & Related papers (2025-11-11T09:46:25Z) - CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent [46.41047559759938]
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces.<n> Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored.<n>We present CUARewardBench, comprising four key contributions.
arXiv Detail & Related papers (2025-10-21T12:53:40Z) - Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z) - AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents [0.0]
This study presents a modular, multi-agent system for the automated review of highly structured enterprise business documents using AI agents.<n>It uses modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section-by-section evaluation of documents.<n>It achieves 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document.
arXiv Detail & Related papers (2025-06-23T17:46:15Z) - Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding [5.861057085203687]
We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of integrating cross-modal inputs.<n>Our approach establishes quantitative relationships between technical metrics and user-centric trust measures.<n>This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.
arXiv Detail & Related papers (2025-06-19T18:05:00Z) - Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - Beyond Confidence: Adaptive Abstention in Dual-Threshold Conformal Prediction for Autonomous System Perception [0.4124847249415279]
Safety-critical perception systems require reliable uncertainty quantification and principled abstention mechanisms to maintain safety.<n>We present a novel dual-threshold conformalization framework that provides statistically-guaranteed uncertainty estimates while enabling selective prediction in high-risk scenarios.
arXiv Detail & Related papers (2025-02-11T04:45:31Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.