M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
- URL: http://arxiv.org/abs/2601.02854v1
- Date: Tue, 06 Jan 2026 09:33:48 GMT
- Title: M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
- Authors: Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du,
- Abstract summary: Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning.<n>Existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison.<n>We introduce M3MAD-Bench, a unified benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics.
- Score: 37.902089112579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.
Related papers
- Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval [10.62333858188658]
Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue.<n>Existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations.<n>We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool.
arXiv Detail & Related papers (2026-01-08T09:07:41Z) - iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference [11.86992814928132]
Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple agents in structured debates.<n>We propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial.<n>We show that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%)
arXiv Detail & Related papers (2025-11-14T13:50:51Z) - MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction [52.89860691282002]
Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce.<n>Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data.<n>We introduce textscmodelname, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences.
arXiv Detail & Related papers (2025-10-07T06:27:42Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - MARS: toward more efficient multi-agent collaboration for LLM reasoning [12.889395413072696]
Multi-Agent Review System (MARS) is a role-based collaboration framework inspired by the review process.<n>We show that MARS matches the accuracy of Multi-Agent Debate (MAD) while reducing both token usage and inference time by approximately 50%.
arXiv Detail & Related papers (2025-09-24T19:24:33Z) - MALLM: Multi-Agent Large Language Models Framework [11.142842314744586]
Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise.<n>We introduce MALLM, an open-source framework that enables systematic analysis of MAD components.
arXiv Detail & Related papers (2025-09-15T07:48:02Z) - MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z) - Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness [50.29739337771454]
Multi-agent debate (MAD) approaches offer improved reasoning, robustness, and diverse perspectives over monolithic models.<n>This paper conceptualizes MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.<n>We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
arXiv Detail & Related papers (2025-05-29T01:02:55Z) - Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity [20.408720462383158]
Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs)<n>Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices.<n>This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models.
arXiv Detail & Related papers (2025-02-12T21:01:10Z) - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.