Related papers: Seeing is Not Understanding: A Benchmark on Perception-Cognition Disparities in Large Language Models

Seeing is Not Understanding: A Benchmark on Perception-Cognition Disparities in Large Language Models

URL: http://arxiv.org/abs/2509.11101v3
Date: Tue, 23 Sep 2025 02:12:08 GMT
Title: Seeing is Not Understanding: A Benchmark on Perception-Cognition Disparities in Large Language Models
Authors: Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang,
Abstract summary: EmoBench-Reddit is a novel, hierarchical benchmark for multimodal emotion understanding.<n>The dataset comprises 350 meticulously curated samples from the social media platform Reddit.<n>Each data point features six multiple-choice questions and one open-ended question of increasing difficulty.
Score: 9.870930749379932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), they have demonstrated exceptional capabilities across a variety of vision-language tasks. However, current evaluation benchmarks predominantly focus on objective visual question answering or captioning, inadequately assessing the models' ability to understand complex and subjective human emotions. To bridge this gap, we introduce EmoBench-Reddit, a novel, hierarchical benchmark for multimodal emotion understanding. The dataset comprises 350 meticulously curated samples from the social media platform Reddit, each containing an image, associated user-provided text, and an emotion category (sad, humor, sarcasm, happy) confirmed by user flairs. We designed a hierarchical task framework that progresses from basic perception to advanced cognition, with each data point featuring six multiple-choice questions and one open-ended question of increasing difficulty. Perception tasks evaluate the model's ability to identify basic visual elements (e.g., colors, objects), while cognition tasks require scene reasoning, intent understanding, and deep empathy integrating textual context. We ensured annotation quality through a combination of AI assistance (Claude 4) and manual verification.We conducted a comprehensive evaluation of nine leading MLLMs, including GPT-5, Gemini-2.5-pro, and GPT-4o, on EmoBench-Reddit.

Related papers

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models [25.072791108956682]
MultiVerse is a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns.<n>With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding.<n>We evaluate 18 Vision-and-Language Models (VLMs) on MultiVerse, revealing that even the strongest models achieve only a 50% success rate in complex multi-turn conversations.
arXiv Detail & Related papers (2025-10-18T21:00:12Z)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models [17.922450921582794]
Occlusion perception is a critical foundation for human-level spatial understanding.<n>We introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception.
arXiv Detail & Related papers (2025-08-06T03:39:21Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering [11.271123465926301]
Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering.<n>We propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions.<n>Experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs.
arXiv Detail & Related papers (2025-06-01T03:15:29Z)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding [26.36195886824082]
Emotion-Qwen is a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general reasoning capabilities.<n>We develop the Video Emotion Reasoning dataset, a large-scale bilingual resource containing over 40K video clips annotated with detailed context-aware emotional descriptions.
arXiv Detail & Related papers (2025-05-10T16:15:26Z)
Grounding Task Assistance with Multimodal Cues from a Single Demonstration [17.975173937253494]
We introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues.<n> Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval.
arXiv Detail & Related papers (2025-05-02T20:43:11Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models [35.24458725308099]
We propose Emotion Interpretation (EI), focusing on causal factors that drive emotional responses.<n>Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling.<n>We present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples.
arXiv Detail & Related papers (2025-04-10T07:33:49Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models [30.986157664865534]
We introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark for evaluating the understanding of implicit meanings in images.<n>This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension.<n>Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning.
arXiv Detail & Related papers (2025-02-19T13:42:37Z)
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks. But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z)
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline. We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z)
TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs. We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark [80.79082788458602]
We provide a new multi-task benchmark for evaluating text-to-image models. We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
arXiv Detail & Related papers (2022-11-22T09:27:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.