A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
- URL: http://arxiv.org/abs/2603.02540v1
- Date: Tue, 03 Mar 2026 02:54:58 GMT
- Title: A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
- Authors: Faiz Ghifari Haznitrama, Faeyza Rishad Ardi, Alice Oh,
- Abstract summary: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks.<n>We introduce the NeuroCognition benchmark, grounded in three adapted neuropsychological tests.<n>Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity.
- Score: 23.297279975389188
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.
Related papers
- Metacognitive Sensitivity for Test-Time Dynamic Model Selection [0.0]
We propose a new framework for evaluating and leveraging AI metacognition.<n>We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy.<n>We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection.
arXiv Detail & Related papers (2025-12-11T09:15:05Z) - Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z) - Think Socially via Cognitive Reasoning [94.60442643943696]
We introduce Cognitive Reasoning, a paradigm modeled on human social cognition.<n>CogFlow is a complete framework that instills this capability in LLMs.
arXiv Detail & Related papers (2025-09-26T16:27:29Z) - 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis [54.24689751375923]
This work introduces a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs.<n>Through experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition.<n>These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities.
arXiv Detail & Related papers (2025-08-27T17:22:34Z) - Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test [5.346677002840565]
This study assesses the cognitive flexibility of state-of-the-art Visual Large Language Models (VLLMs)<n>Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs.
arXiv Detail & Related papers (2025-05-28T08:40:55Z) - Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations [2.759846687681801]
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior.<n>This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control.<n>We introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns.
arXiv Detail & Related papers (2025-05-19T22:32:25Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance [81.05882480184587]
We propose a computational framework of Kolb's learning cycle with Vygotsky's ZPD for autonomous agents.<n>Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.<n>With 9 gold, 8 silver, and 12 bronze medals level performance - including 4 gold and 4 silver on prize-awarding competitions - Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.
arXiv Detail & Related papers (2024-11-05T23:55:23Z) - A Novel Supervised Contrastive Regression Framework for Prediction of
Neurocognitive Measures Using Multi-Site Harmonized Diffusion MRI
Tractography [13.80649748804573]
Supervised Contrastive Regression (SCR) is a simple yet effective method that allows full supervision for contrastive learning in regression tasks.
SCR performs supervised contrastive representation learning by using the absolute difference between continuous regression labels.
SCR improves the accuracy of neurocognitive score prediction compared to other state-of-the-art methods.
arXiv Detail & Related papers (2022-10-13T23:24:12Z) - Modeling cognitive load as a self-supervised brain rate with
electroencephalography and deep learning [2.741266294612776]
This research presents a novel self-supervised method for mental workload modelling from EEG data.
The method is a convolutional recurrent neural network trainable with spatially preserving spectral topographic head-maps from EEG data to fit the brain rate variable.
Findings point to the existence of quasi-stable blocks of learnt high-level representations of cognitive activation because they can be induced through convolution and seem not to be dependent on each other over time, intuitively matching the non-stationary nature of brain responses.
arXiv Detail & Related papers (2022-09-21T07:44:21Z) - AGENT: A Benchmark for Core Psychological Reasoning [60.35621718321559]
Intuitive psychology is the ability to reason about hidden mental variables that drive observable actions.
Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning.
We present a benchmark consisting of procedurally generated 3D animations, AGENT, structured around four scenarios.
arXiv Detail & Related papers (2021-02-24T14:58:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.