Related papers: SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

URL: http://arxiv.org/abs/2506.08119v1
Date: Mon, 09 Jun 2025 18:20:12 GMT
Title: SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents
Authors: Subhrangshu Nandi, Arghya Datta, Nikhil Vichare, Indranil Bhattacharya, Huzefa Raja, Jing Xu, Shayan Ray, Giuseppe Carenini, Abhi Srivastava, Aaron Chan, Man Ho Woo, Amar Kandola, Brandon Theresa, Francesco Carbone,
Abstract summary: Large Language Models (LLMs) demonstrate impressive general-purpose reasoning and problem-solving abilities.<n>LLMs struggle with executing complex, long-horizon that demand strict adherence to Standard Operating Procedures.<n>We develop SOP-Bench, a benchmark of over 1,800 tasks across 10 industrial domains.
Score: 16.08820954102608
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models (LLMs) demonstrate impressive general-purpose reasoning and problem-solving abilities. However, they struggle with executing complex, long-horizon workflows that demand strict adherence to Standard Operating Procedures (SOPs), a critical requirement for real-world industrial automation. Despite this need, there is a lack of public benchmarks that reflect the complexity, structure, and domain-specific nuances of SOPs. To address this, we present three main contributions. First, we introduce a synthetic data generation framework to create realistic, industry-grade SOPs that rigorously test the planning, reasoning, and tool-use capabilities of LLM-based agents. Second, using this framework, we develop SOP-Bench, a benchmark of over 1,800 tasks across 10 industrial domains, each with APIs, tool interfaces, and human-validated test cases. Third, we evaluate two prominent agent architectures: Function-Calling and ReAct Agents, on SOP-Bench, observing average success rates of only 27% and 48%, respectively. Remarkably, when the tool registry is much larger than necessary, agents invoke incorrect tools nearly 100% of the time. These findings underscore a substantial gap between current agentic capabilities of LLMs and the demands of automating real-world SOPs. Performance varies significantly by task and domain, highlighting the need for domain-specific benchmarking and architectural choices before deployment. SOP-Bench is publicly available at http://sop-bench.s3-website-us-west-2.amazonaws.com/. We also release the prompts underpinning the data generation framework to support new domain-specific SOP benchmarks. We invite the community to extend SOP-Bench with SOPs from their industrial domains.

Related papers

Evaluating LLMs on Sequential API Call Through Automated Test Generation [10.621357661774244]
StateGen is an automated framework designed to generate diverse coding tasks involving sequential API interactions.<n>We construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios.<n> Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks.
arXiv Detail & Related papers (2025-07-13T03:52:51Z)
SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments [2.184775414778289]
We introduce setupbench, a benchmark that isolates the environment-bootstrap skill.<n>Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios.<n>We find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%)
arXiv Detail & Related papers (2025-07-11T22:45:07Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [54.52092001110694]
ThinkGeo is a benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain [6.275468311396066]
Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks.<n>We introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields.<n>We evaluate 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen.
arXiv Detail & Related papers (2025-04-18T16:40:39Z)
OmniNova:A General Multimodal Agent Framework [0.5439020425819]
Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems.<n>We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities.
arXiv Detail & Related papers (2025-03-25T19:21:01Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs [9.117180930298813]
General-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise.<n>We introduce the Standard Operational Procedure-guided Agent ( SOP-agent), a novel framework for constructing domain-specific agents.<n> SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks.
arXiv Detail & Related papers (2025-01-16T06:14:58Z)
Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark [38.14474956762422]
We introduce DomainCodeBench, a benchmark designed to evaluate large language models (LLMs) across 12 software application domains and 15 programming languages.<n>We find that top general-domain models do not consistently excel in specific application domains.<n>We show that augmenting prompts with domain-specific knowledge improves performance by around 38.17%.
arXiv Detail & Related papers (2024-12-24T17:56:08Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents [7.166156709980112]
We introduce textscShortcutsBench, a benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks.<n>textscShortcutsBench includes a wealth of real APIs from Apple Inc., refined user queries, human-annotated high-quality action sequences, detailed parameter filling values, and parameters requesting necessary input from the system or user.
arXiv Detail & Related papers (2024-06-28T08:45:02Z)
Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [79.31134731122462]
We introduce OpenAct benchmark to evaluate the open-domain task-solving capability, built on human expert consultation and repositories in GitHub.<n>We present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z)
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.