MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
- URL: http://arxiv.org/abs/2505.20426v4
- Date: Thu, 06 Nov 2025 06:11:21 GMT
- Title: MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
- Authors: Yolo Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu,
- Abstract summary: We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
- Score: 50.33343842822694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
Related papers
- Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs [44.71703930770065]
We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
arXiv Detail & Related papers (2025-12-17T20:22:23Z) - Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective [17.592210658831902]
spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments.<n>Current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings.<n>We present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints.
arXiv Detail & Related papers (2025-12-02T02:21:29Z) - MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity [28.797461492275488]
MME-CC is a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information.<n>Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs.<n>We identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions.
arXiv Detail & Related papers (2025-11-05T03:09:16Z) - Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z) - Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models [17.922450921582794]
Occlusion perception is a critical foundation for human-level spatial understanding.<n>We introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception.
arXiv Detail & Related papers (2025-08-06T03:39:21Z) - VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z) - DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes [51.895756593200296]
Deep Inspection and Perception with RL (DIP-R1) is designed to enhance the visual perception capabilities of MLLMs.<n>DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings.<n>It achieves consistent and significant improvement across various in-domain and out-of-domain scenarios.
arXiv Detail & Related papers (2025-05-29T07:16:16Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [47.237216851265316]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models [0.0]
We investigate the ability of Vision Language Models to perform visual perspective taking using a novel set of visual tasks inspired by established human tests.<n>Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object.<n>Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks.
arXiv Detail & Related papers (2025-05-03T00:10:41Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning.<n>Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z) - VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.