Related papers: SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

URL: http://arxiv.org/abs/2510.08942v1
Date: Fri, 10 Oct 2025 02:47:53 GMT
Title: SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures
Authors: Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao,
Abstract summary: Large language models (LLMs) are widely deployed as domain-specific agents.<n>We propose SOP-Maze, a benchmark constructed from real-world business data.<n>Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze.
Score: 10.868853536476317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

Related papers

MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents [7.339769470891067]
MSCoRe is a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors.<n>The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks.<n>MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents.
arXiv Detail & Related papers (2025-09-22T11:36:16Z)
How Good are Foundation Models in Step-by-Step Embodied Reasoning? [79.15268080287505]
Embodied agents must make decisions that are safe, spatially coherent, and grounded in context.<n>Recent advances in large multimodal models have shown promising capabilities in visual understanding and language generation.<n>Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments.
arXiv Detail & Related papers (2025-09-18T17:56:30Z)
Large Language Models and Operations Research: A Structured Survey [9.208082097215314]
Large language models (LLMs) have shown potential to address limitations through semantic understanding, structured generation, and reasoning control.<n>LLMs can translate natural language descriptions into mathematical models or executable code, generate benchmarks, evolve algorithms, and tackle optimization tasks.
arXiv Detail & Related papers (2025-09-18T01:52:19Z)
Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization [8.356074728041202]
TAM Bench is a benchmark for evaluating large language models (LLMs) on end-to-end machine learning tasks.<n>Three key innovations include a browser automation and LLM-based task acquisition system.<n>Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes.
arXiv Detail & Related papers (2025-09-11T10:10:48Z)
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [64.70546873396624]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling [1.219841051166348]
In this paper, we explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks.<n>We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs.
arXiv Detail & Related papers (2025-05-28T12:28:18Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD. Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.