Related papers: When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

URL: http://arxiv.org/abs/2502.15153v2
Date: Thu, 02 Oct 2025 15:55:21 GMT
Title: When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements
Authors: Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Haodong Zhao, Zhuosheng Zhang, Gongshen Liu,
Abstract summary: We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
Score: 56.29265568399648
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi-agent systems (MAS). However, it remains unclear how disagreements shape collective decision-making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task-critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE-cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among agents and as task-critical counterfactual knowledge edits injected into context or parameters. Experiments reveal that general disagreements consistently improve success by encouraging complementary exploration. By contrast, task-critical disagreements substantially reduce success on single-path reasoning, yet have a limited impact on programming, where agents can choose alternative solutions. Trace analyses show that MAS frequently bypasses the edited facts in programming but rarely does so in reasoning, revealing an emergent self-repair capability that depends on solution-path rather than scale alone. Our code is available at https://github.com/wbw625/MultiAgentRobustness.

Related papers

DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths [29.11412449913759]
We study multi-agent systems composed of general-purpose large language model (LLM) agents that operate without predefined roles, control flow, or communication constraints.<n>We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions.
arXiv Detail & Related papers (2026-02-27T20:59:37Z)
CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z)
DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation [47.62978918069135]
We introduce Dynamic Multi-Agent Debate (DynaDebate), which enhances the effectiveness of multi-agent debate through three key mechanisms.<n>Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state-of-the-art MAD methods.
arXiv Detail & Related papers (2026-01-09T12:01:33Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [29.035097855780858]
SWE-Debate is a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization.<n>It organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace.<n> Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks.
arXiv Detail & Related papers (2025-07-31T08:54:46Z)
MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation [4.177310099979434]
Knowledge conflict often arises in RAG systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge.<n>We propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts.<n> Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict.
arXiv Detail & Related papers (2025-07-29T07:19:49Z)
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness [50.29739337771454]
Multi-agent debate (MAD) approaches offer improved reasoning, robustness, and diverse perspectives over monolithic models.<n>This paper conceptualizes MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.<n>We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
arXiv Detail & Related papers (2025-05-29T01:02:55Z)
Multi-Agent Collaboration via Evolving Orchestration [55.574417128944226]
Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving.<n>We propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states.<n> Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs.
arXiv Detail & Related papers (2025-05-26T07:02:17Z)
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models [23.37800506729006]
We propose MMKC-Bench, a benchmark to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios.<n> MMKC-Bench includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification.<n>Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence.
arXiv Detail & Related papers (2025-05-26T04:39:30Z)
Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures. We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators. We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models [6.52323086990482]
segsub is a framework that applies targeted perturbations to image sources to study and improve the robustness of vision language models.<n>Contrary to prior findings, we find VLMs are largely robust to image perturbation.<n>We find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples.
arXiv Detail & Related papers (2025-02-19T00:26:38Z)
Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding [15.828455477224516]
As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts. We propose a novel method called Multimodal Knowledge Consistency Fine-tuning to mitigate the C&P knowledge conflicts.
arXiv Detail & Related papers (2024-11-12T11:28:50Z)
Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs [55.74117540987519]
This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs) We introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs. We evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries.
arXiv Detail & Related papers (2024-10-10T17:31:17Z)
ECon: On the Detection and Resolution of Evidence Conflicts [56.89209046429291]
The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems. This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios.
arXiv Detail & Related papers (2024-10-05T07:41:17Z)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z)
ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM [36.332500824079844]
Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts has rarely been studied. We present ConflictBank, the first comprehensive benchmark developed to evaluate knowledge conflicts from three aspects. Our investigation delves into four model families and twelve LLM instances, meticulously analyzing conflicts stemming from misinformation, temporal discrepancies, and semantic divergences.
arXiv Detail & Related papers (2024-08-22T02:33:13Z)
Multi-Agent Collaboration via Cross-Team Orchestration [31.506350304184526]
Large Language Models (LLMs) have significantly impacted various domains, especially through organized autonomous agents.<n>We introduce Cross-Team Orchestration (Croto), a scalable multi-team framework that enables orchestrated teams to jointly propose various task-oriented solutions.<n>Experiments reveal a notable increase in software quality compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-06-13T10:18:36Z)
Towards Rationality in Language and Multimodal Agents: A Survey [23.451887560567602]
This work discusses how to build more rational language and multimodal agents.<n> Rationality is quality of being guided by reason, characterized by decision-making that aligns with evidence and logical principles.
arXiv Detail & Related papers (2024-06-01T01:17:25Z)
CoMM: Collaborative Multi-Agent, Multi-Reasoning-Path Prompting for Complex Problem Solving [9.446546965008249]
We propose a collaborative multi-agent, multi-reasoning-path (CoMM) prompting framework. Specifically, we prompt LLMs to play different roles in a problem-solving team, and encourage different role-play agents to collaboratively solve the target task. Empirical results demonstrate the effectiveness of the proposed methods on two college-level science problems.
arXiv Detail & Related papers (2024-04-26T23:29:12Z)
MacGyver: Are Large Language Models Creative Problem Solvers? [87.70522322728581]
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. We create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems. We present our collection to both LLMs and humans to compare and contrast their problem-solving abilities.
arXiv Detail & Related papers (2023-11-16T08:52:27Z)
Resolving Knowledge Conflicts in Large Language Models [46.903549751371415]
Large language models (LLMs) often encounter knowledge conflicts. We ask what are the desiderata for LLMs when a knowledge conflict arises and whether existing LLMs fulfill them. We introduce an evaluation framework for simulating contextual knowledge conflicts.
arXiv Detail & Related papers (2023-10-02T06:57:45Z)
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs [55.66353783572259]
Causal-Consistency Chain-of-Thought harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models.<n>Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations.
arXiv Detail & Related papers (2023-08-23T04:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.