Related papers: Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

URL: http://arxiv.org/abs/2511.08042v1
Date: Wed, 12 Nov 2025 01:36:26 GMT
Title: Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations
Authors: JV Roig,
Abstract summary: We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation.<n>We demonstrate that traditional benchmark rankings poorly predict practical agentic performance.<n>We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency -- findings critical for enterprises making deployment decisions.

Related papers

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents [0.6181816879349377]
We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework.<n>We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick & Carey, and Rapid Prototyping ISD.<n>Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance.
arXiv Detail & Related papers (2026-02-11T08:11:31Z)
The Necessity of a Unified Framework for LLM-Based Agent Evaluation [46.631678638677386]
General-purpose agents have seen fundamental advancements.<n> evaluating these agents presents unique challenges that distinguish them from static QA benchmarks.<n>We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation.
arXiv Detail & Related papers (2026-02-03T08:18:37Z)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [35.52607495764441]
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production.<n>We introduce AgencyBench, a benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios.<n>These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve.
arXiv Detail & Related papers (2026-01-16T07:22:20Z)
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents [1.0305173936249623]
This white paper proposes a novel framework of eleven outcome-based, task-agnostic performance metrics for AI agents.<n>We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE)<n>Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model.
arXiv Detail & Related papers (2025-11-11T13:40:46Z)
APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training [48.20667772172573]
APTBench is a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions.<n>It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research.<n>Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent.
arXiv Detail & Related papers (2025-10-28T13:11:22Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv Detail & Related papers (2025-04-17T04:00:40Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
More than Marketing? On the Information Value of AI Benchmarks for Practitioners [42.73526862595375]
In academia, public benchmarks were generally viewed as suitable measures for capturing research progress.<n>In product and policy, benchmarks were often found to be inadequate for informing substantive decisions.<n>We conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals.
arXiv Detail & Related papers (2024-12-07T03:35:39Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.