Explore the Reasoning Capability of LLMs in the Chess Testbed
- URL: http://arxiv.org/abs/2411.06655v2
- Date: Fri, 28 Feb 2025 11:58:28 GMT
- Title: Explore the Reasoning Capability of LLMs in the Chess Testbed
- Authors: Shu Wang, Lei Ji, Renxi Wang, Wenxiao Zhao, Haokun Liu, Yifan Hou, Ying Nian Wu,
- Abstract summary: We propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic.<n>We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves.
- Score: 45.12891789312405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.
Related papers
- ChessQA: Evaluating Large Language Models for Chess Understanding [10.480398008794436]
Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs)<n>We present ChessQA, a benchmark that assesses LLM chess understanding across five task categories.<n>We find persistent weaknesses across all five categories and provide results and error analyses by category.
arXiv Detail & Related papers (2025-10-28T00:02:52Z) - Evaluating Language Models' Evaluations of Games [65.49017696754825]
We advocate for a new paradigm that assesses AI systems' evaluation of games.<n>We leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments.<n>Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models.
arXiv Detail & Related papers (2025-10-13T02:45:37Z) - ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models [11.234477661864736]
This paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of large language models (LLMs)<n> Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization.<n>We show that no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily.<n>We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-09-29T03:24:48Z) - Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess [54.5355907369231]
We investigate whether large language models (LLMs) can develop strategic reasoning capabilities through reinforcement learning (RL) in chess.<n>Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards.<n>We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess.
arXiv Detail & Related papers (2025-07-01T13:16:34Z) - On the Thinking-Language Modeling Gap in Large Language Models [68.83670974539108]
We show that there is a significant gap between the modeling of languages and thoughts.<n>We propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap.
arXiv Detail & Related papers (2025-05-19T09:31:52Z) - Predicting Chess Puzzle Difficulty with Transformers [0.0]
We present GlickFormer, a novel transformer-based architecture that predicts chess puzzle difficulty by approximating the Glicko-2 rating system.
The proposed model utilizes a modified ChessFormer backbone for spatial feature extraction and incorporates temporal information via factorized transformer techniques.
Results demonstrate GlickFormer's superior performance compared to the state-of-the-art ChessFormer baseline across multiple metrics.
arXiv Detail & Related papers (2024-10-14T20:39:02Z) - Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating
Chess Moves based on Sentiment Analysis [4.314956204483074]
This paper examines chess textbooks as a new knowledge source for enabling machines to learn how to play chess.
We developed the LEAP corpus, a first and new heterogeneous dataset with structured (chess move notations and board states) and unstructured data.
We performed empirical experiments that assess the performance of various transformer-based baseline models for sentiment analysis.
arXiv Detail & Related papers (2023-10-31T08:26:02Z) - All by Myself: Learning Individualized Competitive Behaviour with a
Contrastive Reinforcement Learning optimization [57.615269148301515]
In a competitive game scenario, a set of agents have to learn decisions that maximize their goals and minimize their adversaries' goals at the same time.
We propose a novel model composed of three neural layers that learn a representation of a competitive game, learn how to map the strategy of specific opponents, and how to disrupt them.
Our experiments demonstrate that our model achieves better performance when playing against offline, online, and competitive-specific models, in particular when playing against the same opponent multiple times.
arXiv Detail & Related papers (2023-10-02T08:11:07Z) - Large Language Models on the Chessboard: A Study on ChatGPT's Formal
Language Comprehension and Complex Reasoning Skills [4.138999291282392]
This paper probes the performance of ChatGPT, a sophisticated language model by OpenAI.
We assess ChatGPT's understanding of the chessboard, adherence to chess rules, and strategic decision-making abilities.
Our study also reveals ChatGPT's propensity for a coherent strategy in its gameplay and a noticeable uptick in decision-making assertiveness.
arXiv Detail & Related papers (2023-08-29T08:36:30Z) - ChessGPT: Bridging Policy Learning and Language Modeling [17.85415939196955]
ChessGPT is a GPT model bridging policy learning and language modeling.
We build a large-scale game and language dataset related to chess.
We showcase two model examples ChessCLIP and ChessGPT, integrating policy learning and language modeling.
arXiv Detail & Related papers (2023-06-15T15:35:31Z) - Improving Chess Commentaries by Combining Language Models with Symbolic
Reasoning Engines [31.87260568733666]
We show how to combine symbolic reasoning engines with controllable language models to generate chess commentaries.
We conduct experiments to demonstrate that our approach generates commentaries preferred by human judges over previous baselines.
arXiv Detail & Related papers (2022-12-15T23:38:31Z) - Language Models are Multilingual Chain-of-Thought Reasoners [83.37148309771378]
We introduce the Multilingual Grade School Math benchmark, by manually translating 250 grade-school math problems into ten typologically diverse languages.
We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale.
We show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
arXiv Detail & Related papers (2022-10-06T17:03:34Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Learning Chess Blindfolded: Evaluating Language Models on State Tracking [69.3794549747725]
We consider the task of language modeling for the game of chess.
Unlike natural language, chess notations describe a simple, constrained, and deterministic domain.
We find that transformer language models can learn to track pieces and predict legal moves with high accuracy when trained solely on move sequences.
arXiv Detail & Related papers (2021-02-26T01:16:23Z) - L2E: Learning to Exploit Your Opponent [66.66334543946672]
We propose a novel Learning to Exploit framework for implicit opponent modeling.
L2E acquires the ability to exploit opponents by a few interactions with different opponents during training.
We propose a novel opponent strategy generation algorithm that produces effective opponents for training automatically.
arXiv Detail & Related papers (2021-02-18T14:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.