Related papers: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

URL: http://arxiv.org/abs/2503.01935v1
Date: Mon, 03 Mar 2025 05:18:50 GMT
Title: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Authors: Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You,
Abstract summary: Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents.<n>Existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition.<n>We introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.
Score: 59.825725526176655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Related papers

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models. Controlled Decoding provides a mechanism for aligning a model at inference time without retraining. We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning [12.996741471128539]
Developing intelligent agents for long-term cooperation in dynamic open-world scenarios is a major challenge in multi-agent systems. We propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi-agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long-term planning and reasoning.
arXiv Detail & Related papers (2025-02-08T05:26:02Z)
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications [15.480315462362531]
This report presents a comprehensive evaluation of coordination and routing capabilities in a novel multi-agent collaboration framework.<n>For coordination capabilities, we demonstrate the effectiveness of inter-agent communication and payload referencing mechanisms, achieving end-to-end goal success rates of 90%.<n>Our analysis yields several key findings: multi-agent collaboration enhances goal success rates by up to 70% compared to single-agent approaches in our benchmarks.
arXiv Detail & Related papers (2024-12-06T22:14:17Z)
COMMA: A Communicative Multimodal Multi-Agent Benchmark [7.831385481814481]
We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication.<n>Our findings reveal surprising weaknesses in state-of-the-art models, including proprietary models like GPT-4o.
arXiv Detail & Related papers (2024-10-10T02:49:47Z)
BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems [15.159418172629701]
Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. We propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels.
arXiv Detail & Related papers (2024-08-28T17:43:55Z)
Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning [51.52387511006586]
We propose Hierarchical Opponent modeling and Planning (HOP), a novel multi-agent decision-making algorithm. HOP is hierarchically composed of two modules: an opponent modeling module that infers others' goals and learns corresponding goal-conditioned policies. HOP exhibits superior few-shot adaptation capabilities when interacting with various unseen agents, and excels in self-play scenarios.
arXiv Detail & Related papers (2024-06-12T08:48:06Z)
Scaling Large-Language-Model-based Multi-Agent Collaboration [72.8998796426346]
Recent breakthroughs in large language model-driven autonomous agents have revealed that multi-agent collaboration often surpasses each individual through collective reasoning.<n>This study explores whether the continuous addition of collaborative agents can yield similar benefits.
arXiv Detail & Related papers (2024-06-11T11:02:04Z)
MASP: Scalable GNN-based Planning for Multi-Agent Navigation [18.70078556851899]
Multi-Agent Scalable Graph-based Planner (MASP) is a goal-conditioned hierarchical planner for navigation tasks.<n>MASP employs a hierarchical framework to reduce space complexity by decomposing a large exploration space into multiple goal-conditioned subspaces.<n>For agent cooperation and the adaptation to varying team sizes, we model agents and goals as graphs to better capture their relationship.
arXiv Detail & Related papers (2023-12-05T06:05:04Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Multi-agent Deep Covering Skill Discovery [50.812414209206054]
We propose Multi-agent Deep Covering Option Discovery, which constructs the multi-agent options through minimizing the expected cover time of the multiple agents' joint state space. Also, we propose a novel framework to adopt the multi-agent options in the MARL process. We show that the proposed algorithm can effectively capture the agent interactions with the attention mechanism, successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options.
arXiv Detail & Related papers (2022-10-07T00:40:59Z)
Policy Diagnosis via Measuring Role Diversity in Cooperative Multi-agent RL [107.58821842920393]
We quantify the agent's behavior difference and build its relationship with the policy performance via bf Role Diversity We find that the error bound in MARL can be decomposed into three parts that have a strong relation to the role diversity. The decomposed factors can significantly impact policy optimization on three popular directions.
arXiv Detail & Related papers (2022-06-01T04:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.