Related papers: Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

URL: http://arxiv.org/abs/2305.11595v3
Date: Wed, 18 Oct 2023 06:32:15 GMT
Title: Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate
Authors: Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu and Bing Qin
Abstract summary: Large Language Models (LLMs) have shown impressive capabilities in various applications, but they still face various inconsistency issues. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for developing future collaboration methods.
Score: 41.949869545423375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown impressive capabilities in various applications, but they still face various inconsistency issues. Existing works primarily focus on the inconsistency issues within a single LLM, while we complementarily explore the inter-consistency among multiple LLMs for collaboration. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning, and introduce a formal debate framework (FORD) to conduct a three-stage debate among LLMs with real-world scenarios alignment: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on various datasets, LLMs can effectively collaborate to reach a consensus despite noticeable inter-inconsistencies, but imbalances in their abilities can lead to domination by superior LLMs. Leveraging a more advanced LLM like GPT-4 as an authoritative judge can boost collaboration performance. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for developing future collaboration methods. Codes and data are available at https://github.com/Waste-Wood/FORD

Related papers

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
CrossWordBench is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles. Our evaluation reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents [17.773801766612703]
Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments.
arXiv Detail & Related papers (2025-02-27T13:31:13Z)
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning [26.736078756799635]
We introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning) In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. A MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer. The score serves as the co-training reward, and is then maximized through multi-agent RL.
arXiv Detail & Related papers (2025-02-25T18:33:48Z)
When One LLM Drools, Multi-LLM Collaboration Rules [98.71562711695991]
We argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
arXiv Detail & Related papers (2025-02-06T21:13:44Z)
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z)
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models [32.336273322481276]
Despite their diverse capabilities, Large Language Models (LLMs) exhibit varying strengths and weaknesses. To address these challenges, recent studies have explored collaborative strategies for LLMs. This paper provides a comprehensive overview of this emerging research area, highlighting the motivation behind such collaborations.
arXiv Detail & Related papers (2024-07-08T16:29:08Z)
LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play [43.55248812883912]
Large language models (LLMs) have shown exceptional proficiency in natural language processing but often fall short of generating creative and original responses to open-ended questions. We propose LLM Discussion, a three-phase discussion framework that facilitates vigorous and diverging idea exchanges. We evaluate the efficacy of the proposed framework with the Alternative Uses Test, Similarities Test, Instances Test, and Scientific Creativity Test.
arXiv Detail & Related papers (2024-05-10T10:19:14Z)
Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? [84.36332588191623]
We propose a novel group discussion framework to enrich the set of discussion mechanisms. We observe that the multi-agent discussion performs better than a single agent only when there is no demonstration in the prompt.
arXiv Detail & Related papers (2024-02-28T12:04:05Z)
Theory of Mind for Multi-Agent Collaboration via Large Language Models [5.2767999863286645]
This study evaluates Large Language Models (LLMs)-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents.
arXiv Detail & Related papers (2023-10-16T07:51:19Z)
LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models [23.092480882456048]
This study aims at a detailed analysis of Large Language Models (LLMs) within the context of Pure Coordination Games. Our findings indicate that LLM agents equipped with GPT-4-turbo achieve comparable performance to state-of-the-art reinforcement learning methods. Results on Coordination QA show a large room for improvement in the Theory of Mind reasoning and joint planning abilities of LLMs.
arXiv Detail & Related papers (2023-10-05T21:18:15Z)
Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View [60.80731090755224]
This paper probes the collaboration mechanisms among contemporary NLP systems by practical experiments with theoretical insights. We fabricate four unique societies' comprised of LLM agents, where each agent is characterized by a specific trait' (easy-going or overconfident) and engages in collaboration with a distinct thinking pattern' (debate or reflection) Our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity and consensus reaching, mirroring social psychology theories.
arXiv Detail & Related papers (2023-10-03T15:05:52Z)
Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration [83.4031923134958]
Corex is a suite of novel general-purpose strategies that transform Large Language Models into autonomous agents. Inspired by human behaviors, Corex is constituted by diverse collaboration paradigms including Debate, Review, and Retrieve modes. We demonstrate that orchestrating multiple LLMs to work in concert yields substantially better performance compared to existing methods.
arXiv Detail & Related papers (2023-09-30T07:11:39Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.