VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
- URL: http://arxiv.org/abs/2602.05382v1
- Date: Thu, 05 Feb 2026 07:07:27 GMT
- Title: VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
- Authors: Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis,
- Abstract summary: We introduce VRIQ, a novel benchmark designed to assess and analyze the visual reasoning ability of Vision Language Models (VLMs)<n>We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy.<n>Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone.
- Score: 3.8552182839941884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
Related papers
- BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z) - What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning [26.671128120554457]
causal reasoning is fundamental to solving complex high-level reasoning tasks.<n>Existing benchmarks often include a mixture of reasoning questions.<n>We introduce VQA-Causal and VCR-Causal to isolate and rigorously evaluate causal reasoning abilities.
arXiv Detail & Related papers (2025-06-01T07:17:46Z) - Caption This, Reason That: VLMs Caught in the Middle [3.4820139118440676]
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years.<n>They still lag behind human capabilities in specific visual tasks such as counting or relational reasoning.<n>We analyze VLM performance along core cognitive axes: Perception, Attention, and Memory.
arXiv Detail & Related papers (2025-05-24T14:25:48Z) - VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning [56.99825489208698]
We introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks.<n> VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model.<n>We evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting.
arXiv Detail & Related papers (2025-05-17T16:51:47Z) - IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests [1.1142124321313052]
We introduce **IQBench**, a new benchmark designed to evaluate Vision-Language Models on standardized visual IQ tests.<n>We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction.
arXiv Detail & Related papers (2025-05-17T13:24:08Z) - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - Forgotten Polygons: Multimodal Large Language Models are Shape-Blind [55.65083505741497]
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving.<n>Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons.<n>We propose Visually Cued Chain-of-Thought prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams.
arXiv Detail & Related papers (2025-02-21T22:04:09Z) - ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom [59.92786855289658]
We introduce a novel visual reasoning framework named ProReason.<n>ProReason features decoupled vision-reasoning capabilities and multi-run proactive perception.<n>Our experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks.
arXiv Detail & Related papers (2024-10-18T03:22:06Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.