Related papers: Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

URL: http://arxiv.org/abs/2311.17371v3
Date: Thu, 18 Jul 2024 05:18:14 GMT
Title: Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs
Authors: Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, Arnu Pretorius,
Abstract summary: We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. We find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels.
Score: 7.7433783185451075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a promising strategy for enhancing the truthfulness of LLMs. We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. Importantly, we find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies, such as self-consistency and ensembling using multiple reasoning paths. However, when performing hyperparameter tuning, several MAD systems, such as Multi-Persona, perform better. This suggests that MAD protocols might not be inherently worse than other approaches, but that they are more sensitive to different hyperparameter settings and difficult to optimize. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels, which can significantly enhance performance and even surpass all other non-debate protocols we evaluated. We provide an open-source repository to the community with several state-of-the-art protocols together with evaluation scripts to benchmark across popular research datasets.

Related papers

MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems [4.931691794637798]
Multi-agent debate (MAD) systems leverage collaborative interactions among large language models (LLMs) to improve reasoning capabilities.<n>We introduce MAD-Spear, a targeted prompt injection attack that compromises a small subset of agents but significantly disrupts the overall MAD process.
arXiv Detail & Related papers (2025-07-17T12:09:39Z)
Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models [3.4829662575293585]
Large Language Model (LLM) agents are becoming widely used for various Requirements Engineering (RE) tasks.<n>Research on improving their accuracy mainly focuses on prompt engineering, model fine-tuning, and retrieval augmented generation.<n>We propose that just as human debates enhance accuracy and reduce bias in RE tasks by incorporating diverse perspectives, different LLM agents debating and collaborating may achieve similar improvements.
arXiv Detail & Related papers (2025-07-08T13:37:59Z)
CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate [11.155092859033784]
Multi-Agent Debate (MAD) has emerged as an effective strategy to mitigate issues with single Large Language Model (LLM)<n>Existing MAD methods face two major issues: (a) too lengthy input contexts, which causes LLM agents to get lost in plenty of input information and experiences performance drop; and (b) the overconfidence dilemma, where self-assured LLM agents dominate the debate, leading to low debating effectiveness.<n>We propose a novel MAD method called "CortexDebate", inspired by the human brain's tendency to establish a sparse and dynamically optimized network among cortical areas governed by white matter.
arXiv Detail & Related papers (2025-07-05T07:23:15Z)
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness [50.29739337771454]
Multi-agent debate (MAD) approaches offer improved reasoning, robustness, and diverse perspectives over monolithic models.<n>This paper conceptualizes MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities.<n>We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks.
arXiv Detail & Related papers (2025-05-29T01:02:55Z)
Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning [8.800516398660069]
Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs) We propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates the debate process based on the confidence score of the agent's initial response. DOWN significantly improves efficiency while maintaining or even surpassing the performance of existing multiagent debate systems.
arXiv Detail & Related papers (2025-04-07T13:17:52Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [55.330813919992465]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z)
Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency [0.6827423171182154]
Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. This study presents a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system.
arXiv Detail & Related papers (2025-03-31T19:41:15Z)
If Multi-Agent Debate is the Answer, What is the Question? [19.246022410492692]
Multi-agent debate (MAD) has emerged as a promising approach to enhance the factual accuracy and reasoning quality of large language models. Despite its potential, MAD research suffers from critical shortcomings in evaluation practices. This paper presents a systematic evaluation of five representative MAD methods across nine benchmarks.
arXiv Detail & Related papers (2025-02-12T21:01:10Z)
Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework. At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence. We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z)
DebUnc: Mitigating Hallucinations in Large Language Model Agent Communication with Uncertainty Estimations [52.242449026151846]
DebUnc is a multi-agent debate framework that uses uncertainty metrics to assess agent confidence levels. We adapted the attention mechanism to adjust token weights based on confidence levels. Our evaluations show that attention-based methods are particularly effective.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions. We propose a novel method to address this challenge. We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Improving Multi-Agent Debate with Sparse Communication Topology [9.041025703879905]
Multi-agent debate has proven effective in improving large language models quality for reasoning and factuality tasks. In this paper, we investigate the effect of communication connectivity in multi-agent systems. Our experiments on GPT and Mistral models reveal that multi-agent debates leveraging sparse communication topology can achieve comparable or superior performance.
arXiv Detail & Related papers (2024-06-17T17:33:09Z)
Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents. There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z)
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.