VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
- URL: http://arxiv.org/abs/2507.13361v1
- Date: Fri, 04 Jul 2025 23:15:52 GMT
- Title: VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
- Authors: Shmuel Berman, Jia Deng,
- Abstract summary: Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
- Score: 18.349695067647012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning -- reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
Related papers
- Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes [8.941800684473202]
We introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across over 1,000 sparse-view indoor scenarios.<n>Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions.<n>We propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM.
arXiv Detail & Related papers (2025-06-20T07:42:26Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [51.900488744931785]
We introduce the Visual Graph Arena (VGA) to evaluate and improve AI systems' capacity for visual abstraction.<n>Humans achieve near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.<n>By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models.
arXiv Detail & Related papers (2025-06-06T17:06:25Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting [59.830657530592255]
Counting Amodally for Patterns Through Unseen REgions (CAPTURe) is a testbed for evaluating vision-language models.<n>We evaluate four strong vision-language models on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns.
arXiv Detail & Related papers (2025-04-21T23:38:43Z) - VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [34.95077625513563]
We introduce textbfVLM2-Bench, a benchmark designed to assess whether vision-language models can Visually Link Matching cues.<n> Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings.<n>We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap.
arXiv Detail & Related papers (2025-02-17T17:57:50Z) - Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models [29.078437003042357]
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm.<n>We propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning.
arXiv Detail & Related papers (2025-02-11T14:50:43Z) - Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark featuring deliberately out-of-distribution images.<n>Each question in ViLP is coupled with three potential answers and three corresponding images.<n>We propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z) - Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation [13.311411816150551]
We evaluate how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images.
We curate 247 highly similar image pairs as part of the D3 benchmark.
For each image pair, the model is prompted to: (1) Detect a specific visual difference, and (2) Describe the target image uniquely such that it (3) Discriminates the target image from the distractor.
arXiv Detail & Related papers (2024-09-23T15:31:25Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.