MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration
- URL: http://arxiv.org/abs/2311.08562v2
- Date: Thu, 16 Nov 2023 11:40:26 GMT
- Title: MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration
- Authors: Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer,
See Kiong Ng, Jiashi Feng
- Abstract summary: Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
- Score: 102.41118020705876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have marked a significant advancement in the
field of natural language processing, demonstrating exceptional capabilities in
reasoning, tool usage, and memory. As their applications extend into
multi-agent environments, a need has arisen for a comprehensive evaluation
framework that captures their abilities in reasoning, planning, collaboration,
and more. This work introduces a novel benchmarking framework specifically
tailored to assess LLMs within multi-agent settings, providing quantitative
metrics to evaluate their judgment, reasoning, deception, self-awareness,
cooperation, coordination, and rationality. We utilize games such as Chameleon
and Undercover, alongside game theory scenarios like Cost Sharing, Multi-player
Prisoner's Dilemma, and Public Good, to create diverse testing environments.
Our framework is fortified with the Probabilistic Graphical Modeling (PGM)
method, enhancing the LLMs' capabilities in navigating complex social and
cognitive dimensions. The benchmark evaluates seven multi-agent systems powered
by different LLMs, quantitatively highlighting a significant capability gap
over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It
also confirms that our PGM enhancement boosts the inherent abilities of all
selected models by 50% on average. Our codes are released here
https://github.com/cathyxl/MAgIC.
Related papers
- FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - LLMArena: Assessing Capabilities of Large Language Models in Dynamic
Multi-Agent Environments [35.926581910260076]
We introduce LLMArena, a framework for evaluating the capabilities of large language models in multi-agent dynamic environments.
LLArena employs Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration.
We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents.
arXiv Detail & Related papers (2024-02-26T11:31:48Z) - Dynamic Evaluation of Large Language Models by Meta Probing Agents [44.20074234421295]
We propose meta probing agents (MPA) to evaluate large language models (LLMs)
MPA is the key component of DyVal 2, which naturally extends the previous DyValcitepzhu2023dyval.
MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory.
arXiv Detail & Related papers (2024-02-21T06:46:34Z) - Ocassionally Secure: A Comparative Analysis of Code Generation
Assistants [8.573156248244695]
This paper focuses on identifying and understanding the conditions and contexts in which LLMs can be effectively and safely deployed.
We conducted a comparative analysis of four advanced LLMs--GPT-3.5 and GPT-4 using ChatGPT and Bard and Gemini from Google--using 9 separate tasks to assess each model's code generation capabilities.
We collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability.
arXiv Detail & Related papers (2024-02-01T15:49:47Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models [31.509994889286183]
We introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of language models (LMs) in reasoning, acting, and planning.
A key feature of our approach is the incorporation of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism.
LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on HumanEval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on WebShop with GPT
arXiv Detail & Related papers (2023-10-06T17:55:11Z) - Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs)
To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities.
We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM
Agents [0.0]
We present a novel framework for enhancing the capabilities of large language models (LLMs) by leveraging the power of multi-agent systems.
Our framework introduces a collaborative environment where multiple intelligent agent components, each with distinctive attributes and roles, work together to handle complex tasks more efficiently and effectively.
arXiv Detail & Related papers (2023-06-05T23:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.