MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration
- URL: http://arxiv.org/abs/2311.08562v2
- Date: Thu, 16 Nov 2023 11:40:26 GMT
- Title: MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration
- Authors: Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer,
See Kiong Ng, Jiashi Feng
- Abstract summary: Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
- Score: 102.41118020705876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have marked a significant advancement in the
field of natural language processing, demonstrating exceptional capabilities in
reasoning, tool usage, and memory. As their applications extend into
multi-agent environments, a need has arisen for a comprehensive evaluation
framework that captures their abilities in reasoning, planning, collaboration,
and more. This work introduces a novel benchmarking framework specifically
tailored to assess LLMs within multi-agent settings, providing quantitative
metrics to evaluate their judgment, reasoning, deception, self-awareness,
cooperation, coordination, and rationality. We utilize games such as Chameleon
and Undercover, alongside game theory scenarios like Cost Sharing, Multi-player
Prisoner's Dilemma, and Public Good, to create diverse testing environments.
Our framework is fortified with the Probabilistic Graphical Modeling (PGM)
method, enhancing the LLMs' capabilities in navigating complex social and
cognitive dimensions. The benchmark evaluates seven multi-agent systems powered
by different LLMs, quantitatively highlighting a significant capability gap
over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It
also confirms that our PGM enhancement boosts the inherent abilities of all
selected models by 50% on average. Our codes are released here
https://github.com/cathyxl/MAgIC.
Related papers
- A Survey on Large Language Models with some Insights on their Capabilities and Limitations [0.3222802562733786]
Large Language Models (LLMs) exhibit remarkable performance across various language-related tasks.
LLMs have demonstrated emergent abilities extending beyond their core functions.
This paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities.
arXiv Detail & Related papers (2025-01-03T21:04:49Z) - MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.
Existing benchmarks mostly assess their reasoning abilities in language part.
MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - LLMArena: Assessing Capabilities of Large Language Models in Dynamic
Multi-Agent Environments [35.926581910260076]
We introduce LLMArena, a framework for evaluating the capabilities of large language models in multi-agent dynamic environments.
LLArena employs Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration.
We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents.
arXiv Detail & Related papers (2024-02-26T11:31:48Z) - Dynamic Evaluation of Large Language Models by Meta Probing Agents [44.20074234421295]
We propose meta probing agents (MPA) to evaluate large language models (LLMs)
MPA is the key component of DyVal 2, which naturally extends the previous DyValcitepzhu2023dyval.
MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory.
arXiv Detail & Related papers (2024-02-21T06:46:34Z) - Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs)
To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities.
We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.