Related papers: The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning

The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning

URL: http://arxiv.org/abs/2510.04141v1
Date: Sun, 05 Oct 2025 10:41:22 GMT
Title: The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning
Authors: Mayank Ravishankara, Varindra V. Persad Maharaj,
Abstract summary: We argue that the field is undergoing a paradigm shift, moving from simple recognition tasks to complex reasoning benchmarks.<n>We chart the journey from the foundational "knowledge tests" of the ImageNet era to the "applied logic and comprehension" exams.<n>We explore the uncharted territories of evaluating abstract, creative, and social intelligence.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This survey paper chronicles the evolution of evaluation in multimodal artificial intelligence (AI), framing it as a progression of increasingly sophisticated "cognitive examinations." We argue that the field is undergoing a paradigm shift, moving from simple recognition tasks that test "what" a model sees, to complex reasoning benchmarks that probe "why" and "how" it understands. This evolution is driven by the saturation of older benchmarks, where high performance often masks fundamental weaknesses. We chart the journey from the foundational "knowledge tests" of the ImageNet era to the "applied logic and comprehension" exams such as GQA and Visual Commonsense Reasoning (VCR), which were designed specifically to diagnose systemic flaws such as shortcut learning and failures in compositional generalization. We then survey the current frontier of "expert-level integration" benchmarks (e.g., MMBench, SEED-Bench, MMMU) designed for today's powerful multimodal large language models (MLLMs), which increasingly evaluate the reasoning process itself. Finally, we explore the uncharted territories of evaluating abstract, creative, and social intelligence. We conclude that the narrative of AI evaluation is not merely a history of datasets, but a continuous, adversarial process of designing better examinations that, in turn, redefine our goals for creating truly intelligent systems.

Related papers

Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models [36.1675867877378]
We propose Self-Anchored Knowledge Integration (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization.<n>SAKE significantly mitigates Knowledge Decay (KID) and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
arXiv Detail & Related papers (2026-02-10T08:20:26Z)
On the Measure of a Model: From Intelligence to Generality [0.7561750463371523]
Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs)<n>Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding.<n>Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence.
arXiv Detail & Related papers (2025-11-14T09:46:48Z)
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models [15.929002709503921]
We aim to evaluate a fundamental yet underexplored intelligence: association.<n> MM-OPERA is a systematic benchmark with 11,497 instances across two open-ended tasks.<n>It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning.
arXiv Detail & Related papers (2025-10-30T18:49:06Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models [59.85951092642609]
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world.<n>They often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition)<n>This survey introduces a novel and unified analytical framework: From Perception to Cognition"
arXiv Detail & Related papers (2025-09-29T18:25:40Z)
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training [86.70255651945602]
We introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE)<n>RICE aims to improve reasoning performance without additional training or complexs.<n> Empirical evaluations with leading MoE-based LRMs demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization.
arXiv Detail & Related papers (2025-05-20T17:59:16Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence [22.086567828557683]
Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level.<n>We propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model.
arXiv Detail & Related papers (2025-04-03T02:48:21Z)
How Metacognitive Architectures Remember Their Own Thoughts: A Systematic Review [16.35521789216079]
Metacognition has gained significant attention for its potential to enhance autonomy and adaptability of artificial agents.<n>Existing overviews remain at a conceptual level that is undiscerning to the underlying algorithms, representations, and their respective success.
arXiv Detail & Related papers (2025-02-28T08:48:41Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
In-Context Analogical Reasoning with Pre-Trained Language Models [10.344428417489237]
We explore the use of intuitive language-based abstractions to support analogy in AI systems. Specifically, we apply large pre-trained language models (PLMs) to visual Raven's Progressive Matrices ( RPM) We find that PLMs exhibit a striking capacity for zero-shot relational reasoning, exceeding human performance and nearing supervised vision-based methods.
arXiv Detail & Related papers (2023-05-28T04:22:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.