Related papers: Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks

URL: http://arxiv.org/abs/2410.12853v1
Date: Thu, 10 Oct 2024 21:59:01 GMT
Title: Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks
Authors: Mahmood Hegazy,
Abstract summary: Chain-of-thought prompting, self-verification, and multi-agent debate are proposed to improve the reasoning and factual accuracy of large language models. We find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) excel in natural language generation but often confidently produce incorrect responses, especially in tasks like mathematical reasoning. Chain-of-thought prompting, self-verification, and multi-agent debate are among the strategies proposed to improve the reasoning and factual accuracy of LLMs. Building on Du et al.'s multi-agent debate framework, we find that multi-agent debate helps at any model scale, and that diversity of thought elicits stronger reasoning in debating LLMs. Across various model sizes, performance on mathematical reasoning tasks benefits most when diverse trained models are used. Remarkably, after 4 rounds of debate, a diverse set of medium-capacity models (Gemini-Pro, Mixtral 7BX8, and PaLM 2-M) outperforms GPT-4 on the GSM-8K benchmark, scoring 91% accuracy. By comparison, when 3 instances of Gemini-Pro are used, performance only reaches 82%. Finally, this diverse set of medium-capacity models sets a new state-of-the-art performance on the ASDiv benchmark (94%). These results underscore the idea that the future of AI is agentic, with diverse cooperating agents yielding emergent capabilities beyond even the most powerful individual models.

Related papers

Multiple LLM Agents Debate for Equitable Cultural Alignment [39.974611538629304]
We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision.<n>We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries.<n>Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines.
arXiv Detail & Related papers (2025-05-30T15:01:52Z)
Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One [28.264011412168347]
Model ensemble is a useful approach in reinforcement learning (RL) for training effective agents.<n>We propose LLM-Ens, a novel approach that enhances RL model ensemble with task-specific semantic understandings.
arXiv Detail & Related papers (2025-05-21T09:35:43Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead.
arXiv Detail & Related papers (2025-03-07T18:03:13Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [90.86370957353911]
Chain-of-Reasoning (CoR) is a novel unified framework that integrates multiple reasoning paradigms. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs. We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate [24.92465108034783]
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents. We evaluate the behavior of a network of models collaborating through debate under the influence of an adversary.
arXiv Detail & Related papers (2024-06-20T20:09:37Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Combating Adversarial Attacks with Multi-Agent Debate [4.450536872346658]
We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models.
arXiv Detail & Related papers (2024-01-11T15:57:38Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs) Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [61.07130026622437]
Large Language Models (LLMs) still struggle with natural language reasoning tasks. Motivated by the society of minds, we propose ReConcile. A multi-model multi-agent framework designed as a round table conference among diverse LLM agents.
arXiv Detail & Related papers (2023-09-22T17:12:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.