ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2509.24239v2
- Date: Thu, 06 Nov 2025 13:36:03 GMT
- Title: ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
- Authors: Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao,
- Abstract summary: This paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of large language models (LLMs)<n> Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization.<n>We show that no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily.<n>We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
- Score: 11.234477661864736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
Related papers
- LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess [30.797553771114746]
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs)<n>We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including move quality, move legality, hallucinated actions, and game duration.<n>For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way.
arXiv Detail & Related papers (2025-12-01T18:51:08Z) - ChessQA: Evaluating Large Language Models for Chess Understanding [10.480398008794436]
Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs)<n>We present ChessQA, a benchmark that assesses LLM chess understanding across five task categories.<n>We find persistent weaknesses across all five categories and provide results and error analyses by category.
arXiv Detail & Related papers (2025-10-28T00:02:52Z) - Out-of-distribution Tests Reveal Compositionality in Chess Transformers [6.356179251855671]
We train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization.<n>Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation.<n>In a more challenging test, we evaluate the models on variants including Chess960 - a variant of chess where starting positions of pieces are randomized.
arXiv Detail & Related papers (2025-10-23T17:51:28Z) - Evaluating Language Models' Evaluations of Games [65.49017696754825]
We advocate for a new paradigm that assesses AI systems' evaluation of games.<n>We leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments.<n>Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models.
arXiv Detail & Related papers (2025-10-13T02:45:37Z) - Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess [54.5355907369231]
We investigate whether large language models (LLMs) can develop strategic reasoning capabilities through reinforcement learning (RL) in chess.<n>Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards.<n>We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess.
arXiv Detail & Related papers (2025-07-01T13:16:34Z) - Explore the Reasoning Capability of LLMs in the Chess Testbed [45.12891789312405]
We propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic.<n>We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves.
arXiv Detail & Related papers (2024-11-11T01:42:56Z) - Predicting Chess Puzzle Difficulty with Transformers [0.0]
We present GlickFormer, a novel transformer-based architecture that predicts chess puzzle difficulty by approximating the Glicko-2 rating system.<n>The proposed model utilizes a modified ChessFormer backbone for spatial feature extraction and incorporates temporal information via factorized transformer techniques.<n>Results demonstrate GlickFormer's superior performance compared to the state-of-the-art ChessFormer baseline across multiple metrics.
arXiv Detail & Related papers (2024-10-14T20:39:02Z) - TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs [45.12542636218608]
We propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization.<n>Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark.<n>To provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units.
arXiv Detail & Related papers (2024-10-14T13:15:34Z) - LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities.
We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.