The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
- URL: http://arxiv.org/abs/2508.21143v2
- Date: Wed, 08 Oct 2025 07:49:55 GMT
- Title: The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?
- Authors: Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla,
- Abstract summary: We introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains.<n>Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal.<n>Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V.
- Score: 23.22049250636057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cognitive science research treats visual perception, the ability to understand and make sense of a visual input, as one of the early developmental signs of intelligence. Its TVPS-4 framework categorizes and tests human perception into seven skills such as visual discrimination, and form constancy. Do Multimodal Large Language Models (MLLMs) match up to humans in basic perception? Even though there are many benchmarks that evaluate MLLMs on advanced reasoning and knowledge skills, there is limited research that focuses evaluation on simple perception. In response, we introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains, where each domain tests one or more TVPS-4 skills. Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal. Since modern-day MLLMs can solve much more complex tasks, our a-priori expectation is that they will solve these domains very easily. Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V. We find that as number of objects in the image increases, performance goes down rather fast. Our experiments also identify the perception skills that are considerably harder for all models.
Related papers
- Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Visual Room 2.0: Seeing is Not Understanding for MLLMs [9.870930749379932]
We introduce textitVisual Room 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs.<n>We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks.<n>The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition.
arXiv Detail & Related papers (2025-11-17T03:34:52Z) - Artificial Phantasia: Evidence for Propositional Reasoning-Based Mental Imagery in Large Language Models [0.0]
This study offers a novel approach for benchmarking complex cognitive behavior in artificial systems.<n>We created dozens of novel items of a classic mental imagery task from cognitive psychology.<n>We found that the best LLMs performed significantly above average human performance.
arXiv Detail & Related papers (2025-09-27T04:36:12Z) - Pixels, Patterns, but No Poetry: To See The World like Humans [33.773551676022514]
State-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans.<n>This paper shifts focus from reasoning to perception.
arXiv Detail & Related papers (2025-07-21T21:50:16Z) - Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs [9.951669153984708]
"Do You See Me" is a scalable benchmark with 1,758 images and 2,612 questions.<n>Humans achieve 96.49% accuracy, while top MLLMs average below 50%.<n>This underscores an urgent need for MLLMs with truly robust visual perception.
arXiv Detail & Related papers (2025-05-28T13:31:32Z) - Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [63.81347276258992]
We evaluate the visual cognition capability of Multimodal Large Language Models (MLLMs) and compare their performance with human visual cognition studies.<n>Our comparative experiments with different baselines reveal a gap between MLLMs and human intelligence.<n>We believe that the public release of MaRs-VQA and the Qwen2-VCog baseline model will drive progress toward the next generation of MLLMs with human-like visual cognition abilities.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - TopViewRS: Vision-Language Models as Top-View Spatial Reasoners [38.406430696146714]
Top-view perspective denotes a typical way in which humans read and reason over different types of maps.
We introduce the TopViewRS dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input.
We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity.
arXiv Detail & Related papers (2024-06-04T17:55:43Z) - MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning [22.440669015518015]
We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities.
Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns.
We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
arXiv Detail & Related papers (2024-04-21T09:15:02Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.