Agentic Troubleshooting Guide Automation for Incident Management
- URL: http://arxiv.org/abs/2510.10074v1
- Date: Sat, 11 Oct 2025 07:18:36 GMT
- Title: Agentic Troubleshooting Guide Automation for Incident Management
- Authors: Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang,
- Abstract summary: We present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation.<n>We show that StepFly achieves a 94% success rate on GPT-4.1, outperforming baselines with less time and token consumption.<n>It achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.
- Score: 46.78600624203546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.
Related papers
- Lemon Agent Technical Report [12.663220335253529]
Lemon Agent is a multi-agent orchestrator-worker system built on a newly proposed AgentCortex framework.<n>Our system integrates a hierarchical self-adaptive scheduling mechanism that operates at both the overall orchestrator layer and workers layer.<n>By virtue of this two-tier architecture, the system achieves synergistic balance between global task coordination and local task execution.
arXiv Detail & Related papers (2026-02-06T10:09:49Z) - BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents [10.011001146444325]
Existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure.<n>We propose BEAP-Agent, a framework that supports long-range, multi-level state backtracking with dynamic task tracking and updating.
arXiv Detail & Related papers (2026-01-29T07:22:50Z) - Experience-Guided Adaptation of Inference-Time Reasoning Strategies [49.954515048847874]
Experience-Guided Reasoner (EGuR) generates tailored strategies at inference time based on accumulated experience.<n>EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x.
arXiv Detail & Related papers (2025-11-14T17:45:28Z) - Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach [13.760107452858044]
This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agents.<n>In code development, it achieves up to a 69.6% reduction in total failures function-level development and 56.7% for deployment-level development.
arXiv Detail & Related papers (2025-10-14T03:49:30Z) - Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation [0.0]
We introduce a unified agentic framework that leverages large language models (LLMs) for both discrete fault-recovery planning and continuous process control.<n>Our results demonstrate that, with structured feedback and modular agents, LLMs can unify high-level symbolic planningand low-level continuous control.
arXiv Detail & Related papers (2025-07-03T11:20:22Z) - Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z) - SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning [2.1331883629523634]
SagaLLM is a structured multi-agent architecture designed to address four foundational limitations of current LLM-based planning systems.<n>It bridges this gap by integrating the Saga transactional pattern with persistent memory, automated compensation, and independent validation agents.<n>It achieves significant improvements in consistency, validation accuracy, and adaptive coordination under uncertainty.
arXiv Detail & Related papers (2025-03-15T01:43:03Z) - SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z) - Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback [2.246166820363412]
Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language.<n>LLMs struggle to reliably handle multi-turn task completion.<n>We propose RealTOD, a novel framework that enhances TOD systems through prompt chaining and fine-grained feedback mechanisms.
arXiv Detail & Related papers (2025-02-18T21:36:19Z) - DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR)
In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle.
In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z) - ADaPT: As-Needed Decomposition and Planning with Language Models [131.063805299796]
We introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT)
ADaPT explicitly plans and decomposes complex sub-tasks as-needed, when the Large Language Models is unable to execute them.
Our results demonstrate that ADaPT substantially outperforms established strong baselines.
arXiv Detail & Related papers (2023-11-08T17:59:15Z) - AutoTSG: Learning and Synthesis for Incident Troubleshooting [6.297939852772734]
We conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents.
We find that TSGs are widely used and help significantly reduce mitigation efforts.
We propose AutoTSG -- a novel framework for automation of TSGs executable by combining machine learning and program synthesis.
arXiv Detail & Related papers (2022-05-26T16:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.