Related papers: Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

URL: http://arxiv.org/abs/2601.16823v1
Date: Fri, 23 Jan 2026 15:23:08 GMT
Title: Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess
Authors: Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsäcker,
Abstract summary: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence)<n>We introduce chess as a controlled testbed for disentangling these faculties.<n>We construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning.
Score: 2.8904578737516764
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Fluid Representations in Reasoning Models [91.77876704697779]
We present a mechanistic analysis of how QwQ-32B processes abstract structural information.<n>We find that QwQ-32B gradually improves its internal representation of actions and concepts during reasoning.
arXiv Detail & Related papers (2026-02-04T18:34:50Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding [13.628041236679229]
Vision Language Models (VLMs) have recently shown significant advancements in video understanding, but their capability for counterfactual reasoning remains underexplored.<n>We introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning.<n>We develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality.
arXiv Detail & Related papers (2025-11-25T04:59:55Z)
AI Agents as Universal Task Solvers [94.49762121230042]
We show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information.<n>We argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.
arXiv Detail & Related papers (2025-10-14T02:17:54Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling [60.63703438729223]
We show how different architectures and training methods affect model multi-step reasoning capabilities.<n>We confirm that increasing model depth plays a crucial role for sequential computations.
arXiv Detail & Related papers (2025-08-22T18:57:08Z)
Superficial Self-Improved Reasoners Benefit from Model Merging [49.09091498084467]
Self-improvement as a solution to synthesizing high-quality data corpus.<n>In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities.<n>We propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization.
arXiv Detail & Related papers (2025-03-03T22:41:25Z)
Emergent Abilities in Large Language Models: A Survey [9.50669909278749]
Large Language Models (LLMs) are leading a new technological revolution as one of the most promising research streams toward artificial general intelligence.<n>The scaling of these models has been linked to various so-called emergent abilities that were previously unobserved.<n>These abilities range from advanced reasoning and in-context learning to coding and problem-solving.<n>Despite their transformative potential, emergent abilities remain poorly understood, leading to misconceptions about their definition, nature, predictability, and implications.
arXiv Detail & Related papers (2025-02-28T01:20:01Z)
Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance. Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z)
Essentials for Class Incremental Learning [43.306374557919646]
Class-incremental learning results on CIFAR-100 and ImageNet improve over the state-of-the-art by a large margin, while keeping the approach simple.
arXiv Detail & Related papers (2021-02-18T18:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.