Related papers: M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

URL: http://arxiv.org/abs/2601.02854v1
Date: Tue, 06 Jan 2026 09:33:48 GMT
Title: M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
Authors: Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du,
Abstract summary: Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning.<n>Existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison.<n>We introduce M3MAD-Bench, a unified benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics.
Score: 37.902089112579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.

Related papers

Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval [10.62333858188658]
Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue.<n>Existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations.<n>We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool.
arXiv Detail & Related papers (2026-01-08T09:07:41Z)
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference [11.86992814928132]
Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple agents in structured debates.<n>We propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial.<n>We show that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%)
arXiv Detail & Related papers (2025-11-14T13:50:51Z)
MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction [52.89860691282002]
Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce.<n>Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data.<n>We introduce textscmodelname, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences.
arXiv Detail & Related papers (2025-10-07T06:27:42Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
MARS: toward more efficient multi-agent collaboration for LLM reasoning [12.889395413072696]
Multi-Agent Review System (MARS) is a role-based collaboration framework inspired by the review process.<n>We show that MARS matches the accuracy of Multi-Agent Debate (MAD) while reducing both token usage and inference time by approximately 50%.
arXiv Detail & Related papers (2025-09-24T19:24:33Z)
MALLM: Multi-Agent Large Language Models Framework [11.142842314744586]
Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise.<n>We introduce MALLM, an open-source framework that enables systematic analysis of MAD components.
arXiv Detail & Related papers (2025-09-15T07:48:02Z)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z)
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness [50.29739337771454]
Multi-agent debate (MAD) approaches offer improved reasoning, robustness, and diverse perspectives over monolithic models.<n>This paper conceptualizes MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.<n>We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
arXiv Detail & Related papers (2025-05-29T01:02:55Z)
Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity [20.408720462383158]
Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs)<n>Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices.<n>This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models.
arXiv Detail & Related papers (2025-02-12T21:01:10Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.