Related papers: Are MLMs Trapped in the Visual Room?

Are MLMs Trapped in the Visual Room?

URL: http://arxiv.org/abs/2505.23272v2
Date: Fri, 30 May 2025 14:14:00 GMT
Title: Are MLMs Trapped in the Visual Room?
Authors: Yazhou Zhang, Chunwang Zou, Qimeng Liu, Lu Rong, Ben Yao, Zheng Lian, Qiuchi Li, Peng Zhang, Jing Qin,
Abstract summary: Drawing inspiration from Searle's Chinese Room, we propose the bfVisual Room argument.<n>A system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention.<n>This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm fortexts.
Score: 17.65871959408832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs demonstrate high accuracy in visual perception; (2) even with correct perception, MLMs exhibit an average error rate of ~17.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) this gap stems from weaknesses in context integration, emotional reasoning, and pragmatic inference. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

Related papers

Visual Room 2.0: Seeing is Not Understanding for MLLMs [9.870930749379932]
We introduce textitVisual Room 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs.<n>We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks.<n>The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition.
arXiv Detail & Related papers (2025-11-17T03:34:52Z)
Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection [58.82268659497348]
We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them.<n>We propose Forensic-Chat, a generalizable, explainable, and still-conversational assistant for fake image detection.
arXiv Detail & Related papers (2025-09-29T20:59:19Z)
VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence [24.872901965956604]
VER-Bench is a novel framework to evaluate MLLMs' ability to identify fine-grained visual clues.<n>Each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them.
arXiv Detail & Related papers (2025-08-06T19:59:42Z)
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes [65.88899655866871]
We develop a novel framework, Deep Inspection and Perception with RL (DIP-R1), to enhance visual perception capabilities of MLLMs.<n>DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modeling.<n>Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.
arXiv Detail & Related papers (2025-05-29T07:16:16Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)<n>GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.<n>To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z)
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z)
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models [18.15726815994039]
We introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets.<n>Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts.<n>These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity.
arXiv Detail & Related papers (2025-03-15T14:10:25Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding [65.28200190598082]
We propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo.<n>Our task alleviates the issue via the usage of grid-format inputs that abstractly describe physical phenomena.<n>A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GP-4o, lag behind humans by 40%; (2) the parrot, o1 phenomenon is present in LLMs as they fail on our grid task but can describe and recognize the same concepts well in natural language.
arXiv Detail & Related papers (2025-02-13T04:00:03Z)
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly [44.31985939516153]
Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks.<n>We show that MLLMs often generate incorrect answers even when they understand the visual content.<n>We propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt.
arXiv Detail & Related papers (2024-06-15T13:58:26Z)
AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs. We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z)
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. These models also suffer from hallucinations, which limit their reliability as AI systems. This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.