Measuring General Intelligence with Generated Games
- URL: http://arxiv.org/abs/2505.07215v1
- Date: Mon, 12 May 2025 04:01:03 GMT
- Title: Measuring General Intelligence with Generated Games
- Authors: Vivek Verma, David Huang, William Chen, Dan Klein, Nicholas Tomlin,
- Abstract summary: gg-bench is a collection of game environments designed to evaluate general reasoning capabilities in language models.<n>gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games.
- Score: 35.118590734217264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.
Related papers
- Gameplay Highlights Generation [3.019500891118183]
This work enables gamers to share their gaming experience on social media by automatically generating eye-catching highlight reels from their gameplay session.<n>We develop an in-house gameplay event detection dataset containing interesting events annotated by humans using VIA video annotator.<n>We finetuned a multimodal general purpose video understanding model such as X-CLIP using our dataset which generalizes across multiple games in a genre without per game engineering.
arXiv Detail & Related papers (2025-05-12T16:28:22Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Grammar-based Game Description Generation using Large Language Models [12.329521804287259]
Game Description Language (GDL) provides a standardized way to express diverse games in a machine-readable format.<n>This paper presents a novel framework that leverages Large Language Models (LLMs) to generate grammatically accurate game descriptions from natural language.
arXiv Detail & Related papers (2024-07-24T16:36:02Z) - The Consensus Game: Language Model Generation via Equilibrium Search [73.51411916625032]
We introduce a new, a training-free, game-theoretic procedure for language model decoding.
Our approach casts language model decoding as a regularized imperfect-information sequential signaling game.
Applying EQUILIBRIUM-RANKING to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models.
arXiv Detail & Related papers (2023-10-13T14:27:21Z) - Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step.
GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Pre-trained Language Models as Prior Knowledge for Playing Text-based
Games [2.423547527175808]
In this paper, we improve the semantic understanding of the agent by proposing a simple RL with LM framework.
We perform a detailed study of our framework to demonstrate how our model outperforms all existing agents on the popular game, Zork1.
Our proposed approach also performs comparably to the state-of-the-art models on the other set of text games.
arXiv Detail & Related papers (2021-07-18T10:28:48Z) - Keep CALM and Explore: Language Models for Action Generation in
Text-based Games [27.00685301984832]
We propose the Contextual Action Language Model (CALM) to generate a compact set of action candidates at each game state.
We combine CALM with a reinforcement learning agent which re-ranks the generated action candidates to maximize in-game rewards.
arXiv Detail & Related papers (2020-10-06T17:36:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.