Code World Models for General Game Playing
- URL: http://arxiv.org/abs/2510.04542v1
- Date: Mon, 06 Oct 2025 07:16:07 GMT
- Title: Code World Models for General Game Playing
- Authors: Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, John Schultz, Marcus Chiam, Ian Gemp, Piotr Zielinski, Satinder Singh, Kevin P. Murphy,
- Abstract summary: We use the Large Language Models to translate natural language rules and game trajectories into a formal, executable world model represented as Python code.<n>This generated model serves as a verifiable simulation engine for high-performance planning algorithms.<n>We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.
- Score: 22.382021070682256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach -- involving prompting for direct move generation -- has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model -- comprising functions for state transition, legal move enumeration, and termination checks -- serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.
Related papers
- Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory [37.51238507036326]
We use the game of Twenty Questions to evaluate the information-seeking ability of Large Language Models (LLMs)<n>We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game.
arXiv Detail & Related papers (2026-02-02T06:33:18Z) - From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models [64.43268969806098]
We investigate Causal Induction: the ability to infer governing laws from observational data.<n>We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL.<n>Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation.
arXiv Detail & Related papers (2026-01-30T08:48:23Z) - SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition [70.56416162106036]
We introduce visUal-motion knowledGe for Action Recognition (SUGAR)<n>In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions.<n>We use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions.
arXiv Detail & Related papers (2025-11-13T08:45:24Z) - Boardwalk: Towards a Framework for Creating Board Games with LLMs [0.0]
We aim to investigate whether Large Language Models can implement digital versions of board games from rules described in natural language.<n>We task three state-of-the-art LLMs with coding a selection of 12 popular and obscure games in free-form and within Boardwalk.<n>Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6%% of games without any errors.
arXiv Detail & Related papers (2025-08-22T15:02:07Z) - Baba is LLM: Reasoning in a Game with Dynamic Rules [0.0]
Large language models (LLMs) are known to perform well on language tasks, but struggle with reasoning tasks.<n>This paper explores the ability of LLMs to play the 2D puzzle game Baba is You.
arXiv Detail & Related papers (2025-06-23T20:16:28Z) - Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning [89.93384726755106]
Vision-language reinforcement learning (RL) has primarily focused on narrow domains.<n>We find video games inherently provide rich visual elements and mechanics that are easy to verify.<n>To fully use the multimodal and verifiable reward in video games, we propose Game-RL.
arXiv Detail & Related papers (2025-05-20T03:47:44Z) - Measuring General Intelligence with Generated Games [35.118590734217264]
gg-bench is a collection of game environments designed to evaluate general reasoning capabilities in language models.<n>gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games.
arXiv Detail & Related papers (2025-05-12T04:01:03Z) - GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
GAMA($gamma$)-Bench is a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.<n>$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.<n>Our results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.