VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
- URL: http://arxiv.org/abs/2511.20272v1
- Date: Tue, 25 Nov 2025 12:58:32 GMT
- Title: VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
- Authors: Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang,
- Abstract summary: Visual knowledge forms a bridge between perception and reasoning.<n> Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance.<n>We introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs.
- Score: 35.79620808899466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
Related papers
- Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.28977802424541]
We introduce VCBENCH, a benchmark for multimodal mathematical reasoning with explicit visual dependencies.<n> VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning.<n>We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy.
arXiv Detail & Related papers (2025-04-24T06:16:38Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [63.81347276258992]
We evaluate the visual cognition capability of Multimodal Large Language Models (MLLMs) and compare their performance with human visual cognition studies.<n>Our comparative experiments with different baselines reveal a gap between MLLMs and human intelligence.<n>We believe that the public release of MaRs-VQA and the Qwen2-VCog baseline model will drive progress toward the next generation of MLLMs with human-like visual cognition abilities.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment [31.688373463643373]
Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions.
We present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage.
We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%).
arXiv Detail & Related papers (2024-02-21T06:34:46Z) - MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding.
These models also suffer from hallucinations, which limit their reliability as AI systems.
This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - VCoder: Versatile Vision Encoders for Multimodal Large Language Models [46.95488342139727]
Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks.
However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail.
We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
arXiv Detail & Related papers (2023-12-21T18:49:47Z) - Large Knowledge Model: Perspectives and Challenges [37.42721596964844]
emphLarge Language Models (LLMs) epitomize the pre-training of extensive, sequence-based world knowledge into neural networks.
This article explores large models through the lens of "knowledge"
Considering the intricate nature of human knowledge, we advocate for the creation of emphLarge Knowledge Models (LKM)
arXiv Detail & Related papers (2023-12-05T12:07:30Z) - A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering [53.70661720114377]
multimodal large models (MLMs) have significantly advanced the field of visual understanding, offering remarkable capabilities in realm of visual question answering (VQA)
Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate deep comprehension of the visual information in conjunction with a vast repository of learned knowledge.
To uncover such capabilities, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing
arXiv Detail & Related papers (2023-11-13T18:22:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.