Related papers: ChessQA: Evaluating Large Language Models for Chess Understanding

ChessQA: Evaluating Large Language Models for Chess Understanding

URL: http://arxiv.org/abs/2510.23948v1
Date: Tue, 28 Oct 2025 00:02:52 GMT
Title: ChessQA: Evaluating Large Language Models for Chess Understanding
Authors: Qianfeng Wen, Zhenwei Tang, Ashton Anderson,
Abstract summary: Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs)<n>We present ChessQA, a benchmark that assesses LLM chess understanding across five task categories.<n>We find persistent weaknesses across all five categories and provide results and error analyses by category.
Score: 10.480398008794436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.

Related papers

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess [30.797553771114746]
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs)<n>We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including move quality, move legality, hallucinated actions, and game duration.<n>For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way.
arXiv Detail & Related papers (2025-12-01T18:51:08Z)
Towards Piece-by-Piece Explanations for Chess Positions with SHAP [0.20305676256390937]
We adapt SHAP (SHapley Additive exPlanations) to attribute a chess engines evaluation to specific pieces on the board.<n>By treating pieces as features and systematically ablating them, we compute additive, per-piece contributions that explain the engines output.<n>This method draws inspiration from classical chess pedagogy, where players assess positions by mentally removing pieces.
arXiv Detail & Related papers (2025-10-26T09:07:21Z)
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models [11.234477661864736]
This paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of large language models (LLMs)<n> Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization.<n>We show that no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily.<n>We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-09-29T03:24:48Z)
Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess [54.5355907369231]
We investigate whether large language models (LLMs) can develop strategic reasoning capabilities through reinforcement learning (RL) in chess.<n>Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards.<n>We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess.
arXiv Detail & Related papers (2025-07-01T13:16:34Z)
Explore the Reasoning Capability of LLMs in the Chess Testbed [45.12891789312405]
We propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic.<n>We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves.
arXiv Detail & Related papers (2024-11-11T01:42:56Z)
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z)
Aspect-based Sentiment Evaluation of Chess Moves (ASSESS): an NLP-based Method for Evaluating Chess Strategies from Textbooks [3.652509571098292]
This study investigates the feasibility of using a modified sentiment analysis method as a means for evaluating chess moves based on text. By extracting insights from move-action phrases, our approach aims to provide a more fine-grained and contextually aware chess move'-based sentiment classification.
arXiv Detail & Related papers (2024-05-10T14:23:43Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating Chess Moves based on Sentiment Analysis [4.314956204483074]
This paper examines chess textbooks as a new knowledge source for enabling machines to learn how to play chess. We developed the LEAP corpus, a first and new heterogeneous dataset with structured (chess move notations and board states) and unstructured data. We performed empirical experiments that assess the performance of various transformer-based baseline models for sentiment analysis.
arXiv Detail & Related papers (2023-10-31T08:26:02Z)
Learning Chess Blindfolded: Evaluating Language Models on State Tracking [69.3794549747725]
We consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. We find that transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences.
arXiv Detail & Related papers (2021-02-26T01:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.