Related papers: GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

URL: http://arxiv.org/abs/2508.08501v1
Date: Mon, 11 Aug 2025 22:17:07 GMT
Title: GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games
Authors: Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius,
Abstract summary: We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs)<n>Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks.
Score: 8.640618631999173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning.

Related papers

From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models [64.43268969806098]
We investigate Causal Induction: the ability to infer governing laws from observational data.<n>We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL.<n>Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation.
arXiv Detail & Related papers (2026-01-30T08:48:23Z)
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models [93.844257719952]
We introduce the Game-Time Benchmark framework to assess temporal capabilities.<n>Our evaluation of diverse SLM models reveals a clear performance disparity.<n>The GameTime Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI.
arXiv Detail & Related papers (2025-09-30T15:23:39Z)
V-GameGym: Visual Game Generation for Code Large Language Models [29.687615056084166]
V-GameGym is a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters.<n>We introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis.<n>Our analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development.
arXiv Detail & Related papers (2025-09-24T14:01:18Z)
Play to Generalize: Learning to Reason Through Game Play [11.778612579151067]
We propose a novel post-training paradigm, Visual Game Learning, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games.<n>Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks.
arXiv Detail & Related papers (2025-06-09T17:59:57Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines [34.002194150560086]
We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines.<n> RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS)
arXiv Detail & Related papers (2025-02-01T23:40:24Z)
GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents [19.989503513817095]
Large Language Models can be prompted to "self-play" conversational games that probe certain capabilities. We take one of the proposed frameworks for setting up such game-play environments, and test its usefulness as an evaluation instrument.
arXiv Detail & Related papers (2024-05-31T14:43:31Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs) GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms. We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z)
Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents [20.202525145391093]
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents" This paper explores: Can Large Language Models be evaluated meaningfully by exposing them to constrained game-like settings? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions.
arXiv Detail & Related papers (2023-05-22T19:56:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.