Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance
- URL: http://arxiv.org/abs/2511.10400v1
- Date: Fri, 14 Nov 2025 01:49:12 GMT
- Title: Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance
- Authors: Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng, Yu Tian,
- Abstract summary: Large language models (LLMs) have established LLM-based agents as a major branch of multi-agent systems (MAS)<n>In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance.<n>We design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies.
- Score: 16.514747521376915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7\% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.
Related papers
- Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models [59.6491828112519]
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications.<n> MLLMs are vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions.<n>We propose Feature-space Smoothing (FS), a general framework that provides certified robustness guarantees at the feature representation level of MLLMs.
arXiv Detail & Related papers (2026-01-22T18:52:21Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - ResMAS: Resilience Optimization in LLM-based Multi-agent Systems [37.355345383912756]
Large Language Model-based Multi-Agent Systems (LLM-based MAS)<n>LLM-based MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures.<n>We study the resilience of MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience.
arXiv Detail & Related papers (2026-01-08T08:03:37Z) - Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z) - Testing and Enhancing Multi-Agent Systems for Robust Code Generation [21.38351747327572]
Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation.<n>Despite their prosperous development and adoption, their robustness remains pressingly under-explored.<n>This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach.
arXiv Detail & Related papers (2025-10-12T05:45:04Z) - FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z) - LLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference [1.1538255621565348]
We propose Large Language Model-based agents for automated confounder discovery and subgroup analysis.<n>Our framework systematically performs subgroup identification and confounding structure discovery.<n>Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference.
arXiv Detail & Related papers (2025-08-10T07:45:49Z) - Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing [13.997409139696556]
This paper presents a framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace.<n>We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence.
arXiv Detail & Related papers (2025-07-05T17:26:08Z) - Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems [52.57826440085856]
Large Language Model-based Multi-Agent Systems (LLM-MAS) have demonstrated strong capabilities in solving complex tasks but remain vulnerable when agents receive unreliable messages.<n>This vulnerability stems from a fundamental gap: LLM agents treat all incoming messages equally without evaluating their trustworthiness.<n>We propose Attention Trust Score (A-Trust), a lightweight, attention-based method for evaluating message trustworthiness.
arXiv Detail & Related papers (2025-06-03T07:32:57Z) - A Weighted Byzantine Fault Tolerance Consensus Driven Trusted Multiple Large Language Models Network [53.37983409425452]
Large Language Models (LLMs) have achieved remarkable success across a wide range of applications.<n>Recently, collaborative frameworks such as the Multi-LLM Network (MultiLLMN) have been introduced.<n>We propose a novel Trusted MultiLLMN framework driven by a weighted Byzantine Fault Tolerance (WBFT) blockchain consensus mechanism.
arXiv Detail & Related papers (2025-05-08T10:04:41Z) - A Trustworthy Multi-LLM Network: Challenges,Solutions, and A Use Case [59.58213261128626]
We propose a blockchain-enabled collaborative framework that connects multiple Large Language Models (LLMs) into a Trustworthy Multi-LLM Network (MultiLLMN)<n>This architecture enables the cooperative evaluation and selection of the most reliable and high-quality responses to complex network optimization problems.
arXiv Detail & Related papers (2025-05-06T05:32:46Z) - Statistical Runtime Verification for LLMs via Robustness Estimation [0.0]
Adversarial robustness verification is essential for ensuring the safe deployment of Large Language Models (LLMs) in runtime-critical applications.<n>This paper presents a case study adapting and extending the RoMA statistical verification framework to assess its feasibility as an online runtime robustness monitor for LLMs in black-box deployment settings.
arXiv Detail & Related papers (2025-04-24T16:36:19Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge [0.3759936323189418]
Large Language Models (LLMs) have become increasingly powerful and ubiquitous, but their nature poses challenges to the reliability of their outputs.<n>We introduce a novel framework for rigorously evaluating the reliability of LLM judgments, leveraging McDonald's omega.
arXiv Detail & Related papers (2024-12-17T03:37:31Z) - How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors.
Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.