Related papers: Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

URL: http://arxiv.org/abs/2509.05396v2
Date: Mon, 13 Oct 2025 16:40:01 GMT
Title: Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate
Authors: Andrea Wynn, Harsh Satija, Gillian Hadfield,
Abstract summary: We show that debate can lead to a decrease in accuracy over time.<n>Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning.<n>These results highlight important failure modes in the exchange of reasons during multi-agent debate.
Score: 2.3027211055417283
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time - even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts - including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.

Related papers

DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation [47.62978918069135]
We introduce Dynamic Multi-Agent Debate (DynaDebate), which enhances the effectiveness of multi-agent debate through three key mechanisms.<n>Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state-of-the-art MAD methods.
arXiv Detail & Related papers (2026-01-09T12:01:33Z)
CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate [3.2852123901391077]
We develop a dual-agent debate framework for causal inference.<n>Agents attempt to persuade each other, challenging each other's logic.<n>We show that strong models can still benefit greatly from debate with weaker agents.
arXiv Detail & Related papers (2025-11-28T03:19:35Z)
Real-Time Reasoning Agents in Evolving Environments [52.21796134114843]
We introduce real-time reasoning as a new problem formulation for agents in evolving environments.<n>Our work establishes real-time reasoning as a critical testbed for developing practical agents.
arXiv Detail & Related papers (2025-11-07T00:51:02Z)
The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems [90.96738882568224]
This paper investigates the over-competition in multi-agent debate, where agents under extreme pressure exhibit unreliable, harmful behaviors.<n>To study this phenomenon, we propose HATE, a novel experimental framework that simulates debates under a zero-sum competition arena.
arXiv Detail & Related papers (2025-09-30T11:44:47Z)
Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate [30.66779902590191]
Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability.<n>LLMs' inherent sycophancy can collapse debates into premature consensus.
arXiv Detail & Related papers (2025-09-27T02:27:13Z)
Disagreements in Reasoning: How a Model's Thinking Process Dictates Persuasion in Multi-Agent Systems [49.69773210844221]
This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale.<n>Through a series of multi-agent persuasion experiments, we uncover a fundamental trade-off we term the Persuasion Duality.<n>Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly.
arXiv Detail & Related papers (2025-09-25T12:03:10Z)
MV-Debate: Multi-view Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media [26.07883439550861]
MV-Debate is a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection.<n>MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives.
arXiv Detail & Related papers (2025-08-07T16:38:25Z)
Debating for Better Reasoning: An Unsupervised Multimodal Approach [56.74157117060815]
We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
arXiv Detail & Related papers (2025-05-20T17:18:17Z)
Teaching Models to Balance Resisting and Accepting Persuasion [69.68379406317682]
We show that Persuasion-Training (or PBT) can balance positive and negative persuasion.<n>PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models.<n>We find that PBT leads to better and more stable results and less order dependence.
arXiv Detail & Related papers (2024-10-18T16:49:36Z)
GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion [8.948702488582583]
This paper proposes a method to significantly reduce token cost in multi-agent debates. Our method significantly enhances the performance and efficiency of interactions in the multi-agent debate.
arXiv Detail & Related papers (2024-09-21T07:49:38Z)
DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics [52.242449026151846]
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs)<n>We propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs [55.66353783572259]
Causal-Consistency Chain-of-Thought harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models.<n>Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations.
arXiv Detail & Related papers (2023-08-23T04:59:21Z)
Neural Amortized Inference for Nested Multi-agent Reasoning [54.39127942041582]
We propose a novel approach to bridge the gap between human-like inference capabilities and computational limitations. We evaluate our method in two challenging multi-agent interaction domains.
arXiv Detail & Related papers (2023-08-21T22:40:36Z)
What Changed Your Mind: The Roles of Dynamic Topics and Discourse in Argumentation Process [78.4766663287415]
This paper presents a study that automatically analyzes the key factors in argument persuasiveness. We propose a novel neural model that is able to track the changes of latent topics and discourse in argumentative conversations.
arXiv Detail & Related papers (2020-02-10T04:27:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.