Related papers: PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments

PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments

URL: http://arxiv.org/abs/2509.06235v1
Date: Sun, 07 Sep 2025 22:51:12 GMT
Title: PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments
Authors: Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang,
Abstract summary: We introduce PillagerBench, a framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft.<n>We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics.<n>Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play.
Score: 48.892997022500765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.

Related papers

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents [5.173133826653683]
We present TowerMind, a novel environment grounded in the tower defense subgenre of RTS games.<n>We design five benchmark levels to evaluate several widely used Large Language Models.<n>Results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions.
arXiv Detail & Related papers (2026-01-09T16:18:08Z)
LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.81487689011341]
We introduce LM Fight Arena, a novel framework that evaluates large multimodal models in Mortal Kombat II.<n>Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities.
arXiv Detail & Related papers (2025-10-10T02:19:21Z)
Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
Agents of Change: Self-Evolving LLM Agents for Strategic Planning [17.67637003848376]
We benchmark a progression of LLM-based agents, from a simple game-playing agent to systems capable of autonomously rewriting their own prompts and their player agent's code.<n>Our results show that self-evolving agents, particularly when powered by models like Claude 3.7 and GPT-4o, outperform static baselines by autonomously adopting their strategies.
arXiv Detail & Related papers (2025-06-05T05:45:24Z)
FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory [51.96049148869987]
We present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory.<n>We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents.<n>Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios.
arXiv Detail & Related papers (2025-04-19T15:29:04Z)
AVA: Attentive VLM Agent for Mastering StarCraft II [56.07921367623274]
We introduce Attentive VLM Agent (AVA), a multimodal StarCraft II agent that aligns artificial agent perception with the human gameplay experience.<n>Our agent addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay.
arXiv Detail & Related papers (2025-03-07T12:54:25Z)
Preference-based opponent shaping in differentiable games [3.373994463906893]
We propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation.<n>We verify the performance of PBOS algorithm in a variety of differentiable games.
arXiv Detail & Related papers (2024-12-04T06:49:21Z)
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information. This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z)
FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning [25.857375787748715]
We present FightLadder, a real-time fighting game platform, to empower competitive MARL research. We provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode.
arXiv Detail & Related papers (2024-06-04T08:04:23Z)
ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research. Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z)
LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay [55.12945794835791]
Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay. We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction. Results affirm the framework's effectiveness in creating adaptive agents and suggest LLM-based agents' potential in navigating dynamic social interactions.
arXiv Detail & Related papers (2023-10-23T14:35:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.