Related papers: PokéChamp: an Expert-level Minimax Language Agent

PokéChamp: an Expert-level Minimax Language Agent

URL: http://arxiv.org/abs/2503.04094v1
Date: Thu, 06 Mar 2025 05:06:27 GMT
Title: PokéChamp: an Expert-level Minimax Language Agent
Authors: Seth Karten, Andy Luu Nguyen, Chi Jin,
Abstract summary: We introduce Pok'eChamp, a minimax agent powered by Large Language Models (LLMs) for Pok'emon battles.<n>Built on a general framework for two-player competitive games, Pok'eChamp leverages the generalist capabilities of LLMs to enhance minimax tree search.<n>This work compiles the largest real-player Pok'emon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches.
Score: 17.007111119414745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Pok\'eChamp, a minimax agent powered by Large Language Models (LLMs) for Pok\'emon battles. Built on a general framework for two-player competitive games, Pok\'eChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate Pok\'eChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, Pok\'eChamp consistently outperforms the previous best LLM-based bot, Pok\'ellmon powered by GPT-4o, with a 64% win rate. Pok\'eChamp attains a projected Elo of 1300-1500 on the Pok\'emon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pok\'emon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pok\'emon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at https://sites.google.com/view/pokechamp-llm.

Related papers

NitroGen: An Open Foundation Model for Generalist Gaming Agents [101.41866522979548]
NitroGen is a vision-action foundation model for generalist gaming agents.<n>It is trained on 40,000 hours of gameplay videos across more than 1,000 games.
arXiv Detail & Related papers (2026-01-04T16:24:50Z)
Large Language Models as Pokémon Battle Agents: Strategic Play and Content Generation [4.782714372521615]
Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment.<n>This work examines whether Large Language Models (LLMs) can serve as competent battle agents.<n>We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic.
arXiv Detail & Related papers (2025-12-19T07:46:29Z)
Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities [17.019600215402704]
We propose game-based evaluations to holistically assess capabilities.<n>Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules.<n>We manifest this evaluation specifically through Dixit, a fantasy card game.
arXiv Detail & Related papers (2025-10-22T17:21:16Z)
PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent System for Pokemon Red [4.558478169296784]
We introduce Pok'eAI, the first text-based, multi-agent large language model (LLM) framework designed to autonomously play and progress through Pok'emon Red.<n>Our system consists of three specialized agents-Planning, Execution, and Critique-each with its own memory bank, role, and skill set.
arXiv Detail & Related papers (2025-06-30T10:09:13Z)
lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z)
Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers [24.201490513370523]
Competitive Pok'emon Singles (CPS) is a popular strategy game where players learn to exploit their opponent based on imperfect information. We develop a pipeline to reconstruct the first-person perspective of an agent from logs saved from the third-person perspective of a spectator. This dataset enables a black-box approach where we train large sequence models to adapt to their opponent based solely on their input trajectory.
arXiv Detail & Related papers (2025-04-06T07:35:15Z)
Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
Game Development as Human-LLM Interaction [55.03293214439741]
This paper introduces the Chat Game Engine (ChatGE) powered by Human-LLM interaction.<n>ChatGE allows everyone to develop a custom game using natural language through Human-LLM interaction.<n>We construct a ChatGE for poker games as a case study and evaluate it from two perspectives: interaction quality and code correctness.
arXiv Detail & Related papers (2024-08-18T07:06:57Z)
Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard [0.0]
We introduce a novel benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku. The open-source game simulation code available on GitHub allows LLMs to compete and generates detailed data files. We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta.
arXiv Detail & Related papers (2024-07-10T16:14:34Z)
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
GAMA($gamma$)-Bench is a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.<n>$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.<n>Our results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z)
PokeLLMon: A Human-Parity Agent for Pokemon Battles with Large Language Models [7.653580388741887]
We introduce PokeLLMon, the first LLM-embodied agent that achieves human-parity performance in tactical battle games. We show that online battles against human demonstrates PokeLLMon's human-like battle strategies and just-in-time decision making.
arXiv Detail & Related papers (2024-02-02T03:22:12Z)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z)
On Efficient Reinforcement Learning for Full-length Game of StarCraft II [21.768578136029987]
We investigate a hierarchical RL approach involving extracted macro-actions and a hierarchical architecture of neural networks. On a 64x64 map and using restrictive units, we achieve a win rate of 99% against the level-1 built-in AI. We improve our architecture to train the agent against the cheating level AIs and achieve the win rate against the level-8, level-9, and level-10 AIs as 96%, 97%, and 94%, respectively.
arXiv Detail & Related papers (2022-09-23T12:24:21Z)
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification [126.85096257968414]
We construct benchmarks that test the abilities of modern natural language understanding models. In this work, we propose gamification as a framework for data construction.
arXiv Detail & Related papers (2022-01-14T06:49:15Z)
Discovering Multi-Agent Auto-Curricula in Two-Player Zero-Sum Games [31.97631243571394]
We introduce a framework, LMAC, that automates the discovery of the update rule without explicit human design. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance. We show that LMAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO.
arXiv Detail & Related papers (2021-06-04T22:30:25Z)
L2E: Learning to Exploit Your Opponent [66.66334543946672]
We propose a novel Learning to Exploit framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training. We propose a novel opponent strategy generation algorithm that produces effective opponents for training automatically.
arXiv Detail & Related papers (2021-02-18T14:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.