Out-of-distribution Tests Reveal Compositionality in Chess Transformers
- URL: http://arxiv.org/abs/2510.20783v1
- Date: Thu, 23 Oct 2025 17:51:28 GMT
- Title: Out-of-distribution Tests Reveal Compositionality in Chess Transformers
- Authors: Anna Mészáros, Patrik Reizinger, Ferenc Huszár,
- Abstract summary: We train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization.<n>Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation.<n>In a more challenging test, we evaluate the models on variants including Chess960 - a variant of chess where starting positions of pieces are randomized.
- Score: 6.356179251855671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chess is a canonical example of a task that requires rigorous reasoning and long-term planning. Modern decision Transformers - trained similarly to LLMs - are able to learn competent gameplay, but it is unclear to what extent they truly capture the rules of chess. To investigate this, we train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization. Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation: they adhere to fundamental syntactic rules of the game by consistently choosing valid moves even in situations very different from the training data. Moreover, they also generate high-quality moves for OOD puzzles. In a more challenging test, we evaluate the models on variants including Chess960 (Fischer Random Chess) - a variant of chess where starting positions of pieces are randomized. We found that while the model exhibits basic strategy adaptation, they are inferior to symbolic AI algorithms that perform explicit search, but gap is smaller when playing against users on Lichess. Moreover, the training dynamics revealed that the model initially learns to move only its own pieces, suggesting an emergent compositional understanding of the game.
Related papers
- ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models [11.234477661864736]
This paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of large language models (LLMs)<n> Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization.<n>We show that no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily.<n>We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-09-29T03:24:48Z) - Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess [54.5355907369231]
We investigate whether large language models (LLMs) can develop strategic reasoning capabilities through reinforcement learning (RL) in chess.<n>Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards.<n>We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess.
arXiv Detail & Related papers (2025-07-01T13:16:34Z) - Explore the Reasoning Capability of LLMs in the Chess Testbed [45.12891789312405]
We propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic.<n>We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves.
arXiv Detail & Related papers (2024-11-11T01:42:56Z) - Predicting Chess Puzzle Difficulty with Transformers [0.0]
We present GlickFormer, a novel transformer-based architecture that predicts chess puzzle difficulty by approximating the Glicko-2 rating system.<n>The proposed model utilizes a modified ChessFormer backbone for spatial feature extraction and incorporates temporal information via factorized transformer techniques.<n>Results demonstrate GlickFormer's superior performance compared to the state-of-the-art ChessFormer baseline across multiple metrics.
arXiv Detail & Related papers (2024-10-14T20:39:02Z) - Amortized Planning with Large-Scale Transformers: A Case Study on Chess [11.227110138932442]
This paper uses chess, a landmark planning problem in AI, to assess performance on a planning task.
ChessBench is a large-scale benchmark of 10 million chess games with legal move and value annotations (15 billion points) provided by Stockfish.
We show that, although a remarkably good approximation can be distilled into large-scale transformers via supervised learning, perfect distillation is still beyond reach.
arXiv Detail & Related papers (2024-02-07T00:36:24Z) - Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating
Chess Moves based on Sentiment Analysis [4.314956204483074]
This paper examines chess textbooks as a new knowledge source for enabling machines to learn how to play chess.
We developed the LEAP corpus, a first and new heterogeneous dataset with structured (chess move notations and board states) and unstructured data.
We performed empirical experiments that assess the performance of various transformer-based baseline models for sentiment analysis.
arXiv Detail & Related papers (2023-10-31T08:26:02Z) - Finding mixed-strategy equilibria of continuous-action games without
gradients using randomized policy networks [83.28949556413717]
We study the problem of computing an approximate Nash equilibrium of continuous-action game without access to gradients.
We model players' strategies using artificial neural networks.
This paper is the first to solve general continuous-action games with unrestricted mixed strategies and without any gradient information.
arXiv Detail & Related papers (2022-11-29T05:16:41Z) - Determining Chess Game State From an Image [19.06796946564999]
This paper puts forth a new dataset synthesised from a 3D model that is an order of magnitude larger than existing ones.
A novel end-to-end chess recognition system is presented that combines traditional computer vision techniques with deep learning.
The described system achieves an error rate of 0.23% per square on the test set, 28 times better than the current state of the art.
arXiv Detail & Related papers (2021-04-30T13:02:13Z) - Learning Chess Blindfolded: Evaluating Language Models on State Tracking [69.3794549747725]
We consider the task of language modeling for the game of chess.
Unlike natural language, chess notations describe a simple, constrained, and deterministic domain.
We find that transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences.
arXiv Detail & Related papers (2021-02-26T01:16:23Z) - Learning to Play Sequential Games versus Unknown Opponents [93.8672371143881]
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action.
We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents.
Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response.
arXiv Detail & Related papers (2020-07-10T09:33:05Z) - Smooth markets: A basic mechanism for organizing gradient-based learners [47.34060971879986]
We introduce smooth markets (SM-games), a class of n-player games with pairwise zero sum interactions.
SM-games codify a common design pattern in machine learning that includes (some) GANs, adversarial training, and other recent algorithms.
We show that SM-games are amenable to analysis and optimization using first-order methods.
arXiv Detail & Related papers (2020-01-14T09:19:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.