Related papers: The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility

The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility

URL: http://arxiv.org/abs/2511.18302v1
Date: Sun, 23 Nov 2025 05:49:57 GMT
Title: The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility
Authors: Mohan Reddy,
Abstract summary: Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This investigation presents an empirical analysis of the incompatibility between human psychometric frameworks and Large Language Model evaluation. Through systematic assessment of nine frontier models including GPT-5, Claude Opus 4.1, and Gemini 3 Pro Preview using the Cattell-Horn-Carroll theory of intelligence, we identify a paradox that challenges the foundations of cross-substrate cognitive evaluation. Our results show that models achieving above-average human IQ scores ranging from 85.0 to 121.4 simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks, with an overall judge-binary correlation of r = 0.175 (p = 0.001, n = 1800). This disconnect appears most strongly in the crystallized intelligence domain, where every evaluated model achieved perfect binary accuracy while judge scores ranged from 25 to 62 percent, which cannot occur under valid measurement conditions. Using statistical analyses including Item Response Theory modeling, cross-vendor judge validation, and paradox severity indexing, we argue that this disconnect reflects a category error in applying biological cognitive architectures to transformer-based systems. The implications extend beyond methodology to challenge assumptions about intelligence, measurement, and anthropomorphic biases in AI evaluation. We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.

Related papers

The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness [0.284279467589473]
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text.<n>We introduce the Prompt-driven Cognitive Computing Framework (PMCSF) that reverse-engineers unstructured text into structured cognitive vectors.<n>Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain.
arXiv Detail & Related papers (2025-12-01T07:09:38Z)
Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity [15.774418410083515]
We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching.<n>We reveal a striking disconnect between surface performance and reasoning fidelity.<n>Our diagnostics expose reasoning failures invisible to traditional accuracy metrics.
arXiv Detail & Related papers (2025-11-29T16:47:01Z)
Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations.<n>We conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces.
arXiv Detail & Related papers (2025-11-20T18:59:00Z)
AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms [0.4666493857924357]
We investigate whether large language models can achieve sophisticated norm understanding through statistical learning alone.<n>Across two studies, we evaluate multiple AI systems' ability to predict human social appropriateness judgments.<n>Despite this predictive power, all models showed systematic, correlated errors.
arXiv Detail & Related papers (2025-08-26T13:03:56Z)
Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis [0.0]
We develop a framework to assess concept alignment between Large Language Models and human psychological dimensions.<n>A GPT-4 model achieved superior classification accuracy (66.2%), significantly outperforming GPT-3.5 (55.9%) and BERT (48.1%)<n>Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy.
arXiv Detail & Related papers (2025-06-29T01:56:56Z)
MetaQAP - A Meta-Learning Approach for Quality-Aware Pretraining in Image Quality Assessment [1.6274397329511194]
Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions.<n>This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning.<n>The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on Kon
arXiv Detail & Related papers (2025-06-19T21:03:47Z)
Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI) We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Perceptual Attacks of No-Reference Image Quality Models with Human-in-the-Loop [113.75573175709573]
We make one of the first attempts to examine the perceptual robustness of NR-IQA models. We test one knowledge-driven and three data-driven NR-IQA methods under four full-reference IQA models. We find that all four NR-IQA models are vulnerable to the proposed perceptual attack.
arXiv Detail & Related papers (2022-10-03T13:47:16Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation. Recent models relying on neural components surpass classical topic models according to these metrics. We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.