Related papers: Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

URL: http://arxiv.org/abs/2602.11224v1
Date: Wed, 11 Feb 2026 13:31:18 GMT
Title: Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
Authors: Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson,
Abstract summary: Agent-Diff is a benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs.<n>We provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software.<n>We also evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent-diff-bench/agent-diff.

Related papers

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads? [0.8749675983608171]
We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference tasks.<n>We curated 54 tasks from merged pull requests with measurable performance improvements.
arXiv Detail & Related papers (2026-02-23T08:37:53Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
Open Agent Specification (Agent Spec): A Unified Representation for AI Agents [10.685555728094338]
We introduce Open Agent Specification (Agent Spec), a declarative language that defines AI agents and agentic.<n>Agent Spec defines a common set of components, control and data flow semantics, and schemas that allow an agent to be defined once and executed across different runtimes.
arXiv Detail & Related papers (2025-10-05T12:26:42Z)
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection [108.5042835056188]
This work introduces Agent4FaceForgery to address two fundamental problems.<n>How to capture the diverse intents and iterative processes of human forgery creation.<n>How to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.
arXiv Detail & Related papers (2025-09-16T01:05:01Z)
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z)
Evaluating LLMs on Sequential API Call Through Automated Test Generation [10.621357661774244]
StateGen is an automated framework designed to generate diverse coding tasks involving sequential API interactions.<n>We construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios.<n> Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks.
arXiv Detail & Related papers (2025-07-13T03:52:51Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents [33.899782380901314]
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts.<n>Existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes.<n>Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards.
arXiv Detail & Related papers (2025-05-17T07:58:34Z)
A Framework for Testing and Adapting REST APIs as LLM Tools [11.757827071584737]
Large Language Models (LLMs) are increasingly used to build autonomous agents that perform complex tasks with external tools.<n>Current benchmarks overlook these challenges, leaving a gap in assessing API readiness for agent-driven automation.<n>We present a testing framework that systematically evaluates enterprise APIs when wrapped as Python tools for LLM-based agents.
arXiv Detail & Related papers (2025-04-22T02:52:08Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
Reinforcement Learning for Long-Horizon Interactive LLM Agents [56.9860859585028]
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests.<n>We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments.<n>We derive LOOP, a data- and memory-efficient variant of proximal policy optimization.
arXiv Detail & Related papers (2025-02-03T18:35:42Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents [7.166156709980112]
We introduce textscShortcutsBench, a benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks.<n>textscShortcutsBench includes a wealth of real APIs from Apple Inc., refined user queries, human-annotated high-quality action sequences, detailed parameter filling values, and parameters requesting necessary input from the system or user.
arXiv Detail & Related papers (2024-06-28T08:45:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.