When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems
- URL: http://arxiv.org/abs/2601.16280v1
- Date: Thu, 22 Jan 2026 19:24:21 GMT
- Title: When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems
- Authors: Donghao Huang, Gauri Malwe, Zhaoxia Wang,
- Abstract summary: Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation.<n>We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems.<n>This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented AI systems.
- Score: 1.8717456484053328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME-centric deployment in privacy-sensitive environments. Our approach features a 12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment. Our analysis reveals that procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models, while qwen2.5:32b achieves flawless performance matching GPT-4.1. The framework demonstrates that mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6\% success rate, 7.3 s latency), enabling cost-effective intelligent agent deployment for resource-constrained organizations. This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented multi-agent AI systems.
Related papers
- AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems [52.65695508605237]
We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
arXiv Detail & Related papers (2026-01-14T11:32:07Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations [0.0]
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities.<n>Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models.<n>We identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution.
arXiv Detail & Related papers (2025-12-08T12:27:15Z) - Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains [0.0]
We present the first systematic study of cross-LLM behavioral backdoor detection.<n>We show that single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs.<n>We show that model-aware detection model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models.
arXiv Detail & Related papers (2025-11-25T03:33:04Z) - Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z) - Formal Analysis of Metastable Failures in Software Systems [5.436969030534807]
We provide the mathematical foundations of metastability in request-response server systems.<n>We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs.<n>We show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds.
arXiv Detail & Related papers (2025-10-03T22:44:07Z) - MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection [84.75972919995398]
This paper presents a multi-agent system that uses relation extraction to detect disinformation in news articles.<n>The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent, and (iv) a web-scraped data analyzer.<n>Results demonstrate that the multi-agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches.
arXiv Detail & Related papers (2025-08-13T19:14:48Z) - Routine: A Structural Planning Framework for LLM Agent System in Enterprise [10.989149053905587]
The deployment of agent systems in an enterprise environment is often hindered by several challenges.<n>Common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability.<n>This paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing.
arXiv Detail & Related papers (2025-07-19T02:46:19Z) - A Holistic Assessment of the Reliability of Machine Learning Systems [30.638615396429536]
This paper proposes a holistic assessment methodology for the reliability of machine learning (ML) systems.
Our framework evaluates five key properties: in-distribution accuracy, distribution-shift robustness, adversarial robustness, calibration, and out-of-distribution detection.
To provide insights into the performance of different algorithmic approaches, we identify and categorize state-of-the-art techniques.
arXiv Detail & Related papers (2023-07-20T05:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.