Related papers: Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

URL: http://arxiv.org/abs/2508.03368v1
Date: Tue, 05 Aug 2025 12:15:59 GMT
Title: Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play
Authors: Lucia Cipolina-Kun, Marianna Nezhurina, Jenia Jitsev,
Abstract summary: The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library.<n>It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray.<n>This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior.
Score: 12.20709692079716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, human, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray. Additionally it provides extensive analysis tools for the LLM reasoning traces. This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior

Related papers

Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games [16.187737674778234]
We present textbfbenchname, a benchmark designed to train and evaluate Large Language Model (LLM) agents across diverse real-world video games.<n>To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP)<n>Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects.
arXiv Detail & Related papers (2025-06-04T06:40:33Z)
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z)
WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis [34.639887462203]
We introduce an open, scalable, and real-time updated platform for accessing and analyzing the LLM-based MAS based on the games Who is Spy?" (WiS)<n>Our platform is featured with three main worths: (1) a unified model evaluate interface that supports models available on H Face; (2) real-time updated leaderboard for model evaluation; and (3) a comprehensive evaluation covering game-winning rates, attacking, defense strategies, and reasoning of LLMs.
arXiv Detail & Related papers (2024-12-04T14:45:09Z)
Mastering Board Games by External and Internal Planning with Language Models [30.782334791241556]
We show that search-based planning can yield significant improvements in Large Language Models game-playing strength.<n>We introduce, compare and contrast two major approaches: in external search, the model guides Monte Carlo Tree Search rollouts and evaluations without calls to an external game engine, and in internal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice.<n>Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
arXiv Detail & Related papers (2024-12-02T18:56:51Z)
Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash [6.65572931991284]
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs.
arXiv Detail & Related papers (2024-11-15T18:42:48Z)
LLMBox: A Comprehensive Library for Large Language Models [109.15654830320553]
This paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of large language models (LLMs) This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency.
arXiv Detail & Related papers (2024-07-08T02:39:33Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents [20.202525145391093]
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents" This paper explores: Can Large Language Models be evaluated meaningfully by exposing them to constrained game-like settings? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions.
arXiv Detail & Related papers (2023-05-22T19:56:10Z)
Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-based Games [64.11746320061965]
We study reinforcement learning for text-based games, which are interactive simulations in the context of natural language. We aim to conduct explicit reasoning with knowledge graphs for decision making, so that the actions of an agent are generated and supported by an interpretable inference procedure. We extensively evaluate our method on a number of man-made benchmark games, and the experimental results demonstrate that our method performs better than existing text-based agents.
arXiv Detail & Related papers (2020-10-22T12:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.