Related papers: BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

URL: http://arxiv.org/abs/2408.15971v1
Date: Wed, 28 Aug 2024 17:43:55 GMT
Title: BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Authors: Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang,
Abstract summary: Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. We propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels.
Score: 15.159418172629701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

Related papers

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning [29.580108004844856]
Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks. Recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks. We introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.
arXiv Detail & Related papers (2025-04-14T00:27:45Z)
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [59.825725526176655]
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents. Existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. We introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.
arXiv Detail & Related papers (2025-03-03T05:18:50Z)
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration [51.452664740963066]
Collaborative Gym is a framework enabling asynchronous, tripartite interaction among agents, humans, and task environments. We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance.
arXiv Detail & Related papers (2024-12-20T09:21:15Z)
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications [15.480315462362531]
This report presents a comprehensive evaluation of coordination and routing capabilities in a novel multi-agent collaboration framework. For coordination capabilities, we demonstrate the effectiveness of inter-agent communication and payload referencing mechanisms, achieving end-to-end goal success rates of 90%. Our analysis yields several key findings: multi-agent collaboration enhances goal success rates by up to 70% compared to single-agent approaches in our benchmarks.
arXiv Detail & Related papers (2024-12-06T22:14:17Z)
COMMA: A Communicative Multimodal Multi-Agent Benchmark [7.831385481814481]
We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-the-art models.
arXiv Detail & Related papers (2024-10-10T02:49:47Z)
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Scaling Large-Language-Model-based Multi-Agent Collaboration [75.5241464256688]
Pioneering advancements in large language model-powered agents have underscored the design pattern of multi-agent collaboration. Inspired by the neural scaling law, this study investigates whether a similar principle applies to increasing agents in multi-agent collaboration.
arXiv Detail & Related papers (2024-06-11T11:02:04Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Multi-Agent Consensus Seeking via Large Language Models [6.922356864800498]
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner. This work considers a fundamental problem in multi-agent collaboration: consensus seeking.
arXiv Detail & Related papers (2023-10-31T03:37:11Z)
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs) To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z)
ProAgent: Building Proactive Cooperative Agents with Large Language Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents. ProAgent can analyze the present state, and infer the intentions of teammates from observations. ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z)
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [93.38830440346783]
We propose a multi-agent framework framework that can collaboratively adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that framework framework can effectively deploy multi-agent groups that outperform a single agent. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups.
arXiv Detail & Related papers (2023-08-21T16:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.