Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
- URL: http://arxiv.org/abs/2506.03610v1
- Date: Wed, 04 Jun 2025 06:40:33 GMT
- Title: Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
- Authors: Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho,
- Abstract summary: We present textbfbenchname, a benchmark designed to train and evaluate Large Language Model (LLM) agents across diverse real-world video games.<n>To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP)<n>Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects.
- Score: 16.187737674778234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present \textbf{\benchname{}}, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at https://github.com/krafton-ai/Orak.
Related papers
- Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play [12.20709692079716]
The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library.<n>It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray.<n>This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior.
arXiv Detail & Related papers (2025-08-05T12:15:59Z) - lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z) - Empowering LLMs in Decision Games through Algorithmic Data Synthesis [29.128280701799074]
Decision-making games serve as ideal sandboxes for evaluating and enhancing the reasoning abilities of Large Language Models.<n>We design data synthesis strategies and curate extensive offline datasets from two classic games, Doudizhu and Go.<n>We develop a suite of techniques to effectively incorporate this data into LLM training, resulting in two novel agents: Mastermind-Dou and Mastermind-Go.
arXiv Detail & Related papers (2025-03-18T07:30:29Z) - GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents [10.632179121247466]
We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire.
Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players.
This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process.
arXiv Detail & Related papers (2024-10-01T18:40:43Z) - Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games [2.2648566044372416]
We introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games.<n>Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents.<n>Our results show that these multimodal LLMs are not yet capable of being zero-shot low-level policies.
arXiv Detail & Related papers (2024-08-28T17:08:56Z) - A Survey on Large Language Model-Based Game Agents [9.140712714337273]
The development of game agents holds a critical role in advancing towards Artificial General Intelligence.<n>The progress of Large Language Models (LLMs) offers an unprecedented opportunity to evolve and empower game agents.<n>This paper provides a comprehensive overview of LLM-based game agents from a holistic viewpoint.
arXiv Detail & Related papers (2024-04-02T15:34:18Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.<n>The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.<n>This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - SmartPlay: A Benchmark for LLMs as Intelligent Agents [45.76707302899935]
SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft.
Each game challenges a subset of 9 important capabilities of an intelligent LLM agent.
Tests include reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness.
arXiv Detail & Related papers (2023-10-02T18:52:11Z) - Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs)
To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities.
We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z) - GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.