Related papers: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

URL: http://arxiv.org/abs/2601.19921v1
Date: Fri, 09 Jan 2026 02:38:30 GMT
Title: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity
Authors: Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos,
Abstract summary: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling.<n>Recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost.<n>We identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication.
Score: 31.236476720977294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

Related papers

Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning [49.99694105650486]
Self-Debate Reinforcement Learning (SDRL) is a training framework that equips a single large language model with strong problem-solving ability.<n>We show that SDRL improves overall Multi-Agent Debate (MAD) performance while simultaneously strengthening single model reasoning.
arXiv Detail & Related papers (2026-01-29T20:21:44Z)
Confidence Estimation for LLMs in Multi-turn Interactions [48.081802290688394]
This work presents the first systematic study of confidence estimation in multi-turn interactions.<n>We establish a formal evaluation framework grounded in two key desideratas: per-turn calibration and monotonicity of confidence.<n>Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
arXiv Detail & Related papers (2026-01-05T14:58:04Z)
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection [81.52796950244705]
Self-diagnosis is unreliable on complex tasks unless aided by reliable external feedback.<n>We introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game.<n>We show that ColMAD significantly outperforms previous competitive MAD by 19%.
arXiv Detail & Related papers (2025-10-23T19:46:00Z)
Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate [30.66779902590191]
Large language models (LLMs) often display sycophancy, a tendency toward excessive agreeability.<n>LLMs' inherent sycophancy can collapse debates into premature consensus.
arXiv Detail & Related papers (2025-09-27T02:27:13Z)
Enhancing Multi-Agent Debate System Performance via Confidence Expression [55.34012400580016]
Multi-Agent Debate (MAD) systems simulate human debate and thereby improve task performance.<n>Some Large Language Models (LLMs) possess superior knowledge or reasoning capabilities for specific tasks, but struggle to clearly communicate this advantage during debates.<n>Inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers.<n>We develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process.
arXiv Detail & Related papers (2025-09-17T14:34:27Z)
Free-MAD: Consensus-Free Multi-Agent Debate [17.384699873512464]
Multi-agent debate (MAD) is an emerging approach to improving the reasoning capabilities of large language models (LLMs)<n>Existing MAD methods rely on multiple rounds of interaction among agents to reach consensus, and the final output is selected by majority voting in the last round.<n>We propose textscFree-MAD, a novel MAD framework that eliminates the need for consensus among agents.
arXiv Detail & Related papers (2025-09-14T01:55:01Z)
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? [13.569822165805851]
Multi-Agent Debate(MAD) has emerged as a promising paradigm for improving the performance of large language models.<n>Despite recent advances, the key factors driving MAD's effectiveness remain unclear.<n>We disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions.
arXiv Detail & Related papers (2025-08-24T22:14:32Z)
Confident-Knowledge Diversity Drives Human-Human and Human-AI Free Discussion Synergy and Reveals Pure-AI Discussion Shortfalls [3.335241944417891]
We study whether large language models can replicate the synergistic gains observed in human discussion.<n>We introduce an agent-agnostic confident-knowledge framework that models each participant by performance (accuracy) and confidence.<n>This framework quantifies confident-knowledge diversity, the degree to which one agent tends to be correct when another is uncertain, and yields a conservative upper bound on gains achievable via confidence-informed decisions.
arXiv Detail & Related papers (2025-06-15T05:09:20Z)
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness [50.29739337771454]
Multi-agent debate (MAD) approaches offer improved reasoning, robustness, and diverse perspectives over monolithic models.<n>This paper conceptualizes MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.<n>We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
arXiv Detail & Related papers (2025-05-29T01:02:55Z)
Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity [20.408720462383158]
Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs)<n>Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices.<n>This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models.
arXiv Detail & Related papers (2025-02-12T21:01:10Z)
DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics [52.242449026151846]
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs)<n>We propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.