PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments
- URL: http://arxiv.org/abs/2509.06235v1
- Date: Sun, 07 Sep 2025 22:51:12 GMT
- Title: PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments
- Authors: Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang,
- Abstract summary: We introduce PillagerBench, a framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft.<n>We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics.<n>Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play.
- Score: 48.892997022500765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.
Related papers
- TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents [5.173133826653683]
We present TowerMind, a novel environment grounded in the tower defense subgenre of RTS games.<n>We design five benchmark levels to evaluate several widely used Large Language Models.<n>Results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions.
arXiv Detail & Related papers (2026-01-09T16:18:08Z) - LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.81487689011341]
We introduce LM Fight Arena, a novel framework that evaluates large multimodal models in Mortal Kombat II.<n>Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities.
arXiv Detail & Related papers (2025-10-10T02:19:21Z) - Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z) - Agents of Change: Self-Evolving LLM Agents for Strategic Planning [17.67637003848376]
We benchmark a progression of LLM-based agents, from a simple game-playing agent to systems capable of autonomously rewriting their own prompts and their player agent's code.<n>Our results show that self-evolving agents, particularly when powered by models like Claude 3.7 and GPT-4o, outperform static baselines by autonomously adopting their strategies.
arXiv Detail & Related papers (2025-06-05T05:45:24Z) - FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory [51.96049148869987]
We present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory.<n>We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents.<n>Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios.
arXiv Detail & Related papers (2025-04-19T15:29:04Z) - AVA: Attentive VLM Agent for Mastering StarCraft II [56.07921367623274]
We introduce Attentive VLM Agent (AVA), a multimodal StarCraft II agent that aligns artificial agent perception with the human gameplay experience.<n>Our agent addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay.
arXiv Detail & Related papers (2025-03-07T12:54:25Z) - Preference-based opponent shaping in differentiable games [3.373994463906893]
We propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation.<n>We verify the performance of PBOS algorithm in a variety of differentiable games.
arXiv Detail & Related papers (2024-12-04T06:49:21Z) - Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information.
This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z) - FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning [25.857375787748715]
We present FightLadder, a real-time fighting game platform, to empower competitive MARL research.
We provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics.
We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode.
arXiv Detail & Related papers (2024-06-04T08:04:23Z) - ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic
Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research.
Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z) - LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay [55.12945794835791]
Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay.
We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction.
Results affirm the framework's effectiveness in creating adaptive agents and suggest LLM-based agents' potential in navigating dynamic social interactions.
arXiv Detail & Related papers (2023-10-23T14:35:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.