KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
- URL: http://arxiv.org/abs/2601.08292v1
- Date: Tue, 13 Jan 2026 07:32:50 GMT
- Title: KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
- Authors: Xianfeng Wang, Kaiwei Zhang, Qi Jia, Zijian Chen, Guangtao Zhai, Xiongkuo Min,
- Abstract summary: We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
- Score: 79.27736230305516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
Related papers
- MentisOculi: Revealing the Limits of Reasoning with Mental Imagery [63.285794947638614]
We develop MentisOculi, a suite of multi-step reasoning problems amenable to visual solution.<n> evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance.<n>Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning.
arXiv Detail & Related papers (2026-02-02T18:49:06Z) - BabyVision: Visual Reasoning Beyond Language [60.43605497226761]
We introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge.<n> Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines.<n> Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1.
arXiv Detail & Related papers (2026-01-10T10:42:44Z) - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - Visual Language Models show widespread visual deficits on neuropsychological tests [0.0]
We use the toolkit of neuropsychology to assess the capabilities of three state-of-the-art Visual Language Models (VLMs)<n>We find widespread deficits in low- and mid-level visual abilities that would be considered clinically significant in humans.<n>These selective deficits, profiled through validated test batteries, suggest that an artificial system can achieve complex object recognition without developing foundational visual concepts that in humans require no explicit training.
arXiv Detail & Related papers (2025-04-15T01:04:56Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Core Knowledge Deficits in Multi-Modal Language Models [41.422258645731276]
Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning.<n>But their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans.<n>We examine the hypothesis that these deficiencies stem from the absence of core knowledge--rudimentary cognitive abilities innate to humans from early childhood.
arXiv Detail & Related papers (2024-10-06T20:13:11Z) - KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models [43.86823330035457]
This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children.<n>We propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning.
arXiv Detail & Related papers (2024-07-25T05:02:39Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [63.81347276258992]
We evaluate the visual cognition capability of Multimodal Large Language Models (MLLMs) and compare their performance with human visual cognition studies.<n>Our comparative experiments with different baselines reveal a gap between MLLMs and human intelligence.<n>We believe that the public release of MaRs-VQA and the Qwen2-VCog baseline model will drive progress toward the next generation of MLLMs with human-like visual cognition abilities.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.