Related papers: Agentic Troubleshooting Guide Automation for Incident Management

Agentic Troubleshooting Guide Automation for Incident Management

URL: http://arxiv.org/abs/2510.10074v1
Date: Sat, 11 Oct 2025 07:18:36 GMT
Title: Agentic Troubleshooting Guide Automation for Incident Management
Authors: Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang,
Abstract summary: We present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation.<n>We show that StepFly achieves a 94% success rate on GPT-4.1, outperforming baselines with less time and token consumption.<n>It achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.
Score: 46.78600624203546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.

Related papers

Lemon Agent Technical Report [12.663220335253529]
Lemon Agent is a multi-agent orchestrator-worker system built on a newly proposed AgentCortex framework.<n>Our system integrates a hierarchical self-adaptive scheduling mechanism that operates at both the overall orchestrator layer and workers layer.<n>By virtue of this two-tier architecture, the system achieves synergistic balance between global task coordination and local task execution.
arXiv Detail & Related papers (2026-02-06T10:09:49Z)
BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents [10.011001146444325]
Existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure.<n>We propose BEAP-Agent, a framework that supports long-range, multi-level state backtracking with dynamic task tracking and updating.
arXiv Detail & Related papers (2026-01-29T07:22:50Z)
Experience-Guided Adaptation of Inference-Time Reasoning Strategies [49.954515048847874]
Experience-Guided Reasoner (EGuR) generates tailored strategies at inference time based on accumulated experience.<n>EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x.
arXiv Detail & Related papers (2025-11-14T17:45:28Z)
Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach [13.760107452858044]
This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agents.<n>In code development, it achieves up to a 69.6% reduction in total failures function-level development and 56.7% for deployment-level development.
arXiv Detail & Related papers (2025-10-14T03:49:30Z)
Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation [0.0]
We introduce a unified agentic framework that leverages large language models (LLMs) for both discrete fault-recovery planning and continuous process control.<n>Our results demonstrate that, with structured feedback and modular agents, LLMs can unify high-level symbolic planningand low-level continuous control.
arXiv Detail & Related papers (2025-07-03T11:20:22Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning [2.1331883629523634]
SagaLLM is a structured multi-agent architecture designed to address four foundational limitations of current LLM-based planning systems.<n>It bridges this gap by integrating the Saga transactional pattern with persistent memory, automated compensation, and independent validation agents.<n>It achieves significant improvements in consistency, validation accuracy, and adaptive coordination under uncertainty.
arXiv Detail & Related papers (2025-03-15T01:43:03Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback [2.246166820363412]
Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language.<n>LLMs struggle to reliably handle multi-turn task completion.<n>We propose RealTOD, a novel framework that enhances TOD systems through prompt chaining and fine-grained feedback mechanisms.
arXiv Detail & Related papers (2025-02-18T21:36:19Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
ADaPT: As-Needed Decomposition and Planning with Language Models [131.063805299796]
We introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT) ADaPT explicitly plans and decomposes complex sub-tasks as-needed, when the Large Language Models is unable to execute them. Our results demonstrate that ADaPT substantially outperforms established strong baselines.
arXiv Detail & Related papers (2023-11-08T17:59:15Z)
AutoTSG: Learning and Synthesis for Incident Troubleshooting [6.297939852772734]
We conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents. We find that TSGs are widely used and help significantly reduce mitigation efforts. We propose AutoTSG -- a novel framework for automation of TSGs executable by combining machine learning and program synthesis.
arXiv Detail & Related papers (2022-05-26T16:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.