Related papers: VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

URL: http://arxiv.org/abs/2511.18121v1
Date: Sat, 22 Nov 2025 17:01:03 GMT
Title: VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
Authors: Ming Zhong, Yuanlei Wang, Liuzhou Zhang, Arctanx An, Renrui Zhang, Hao Liang, Ming Lu, Ying Shen, Wentao Zhang,
Abstract summary: We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding.<n>Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics.
Score: 49.55286536996476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

Related papers

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models [42.79282247484499]
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue.<n>Recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge.<n>We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks.
arXiv Detail & Related papers (2026-02-19T22:07:29Z)
Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision [22.712690974750007]
Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training.<n>Despite its importance, the projection layer's ability to generalize to unseen visual concepts has not been systematically evaluated.<n>This study introduces a new evaluation framework for alignment generalization.
arXiv Detail & Related papers (2025-08-31T05:00:51Z)
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger [51.01841635655944]
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks.<n>Existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge.<n>We propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method.
arXiv Detail & Related papers (2025-06-09T14:00:57Z)
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation [19.46864730994867]
We introduce textbfCOVER (textbfunderlineCOunterfactual textbfunderlineEo textbfunderlineReasoning), a multidimensional multimodal benchmark.<n>It decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis.
arXiv Detail & Related papers (2025-03-12T03:25:51Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.79428219851289]
Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.<n>Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
arXiv Detail & Related papers (2024-12-04T18:58:10Z)
DEAL: Disentangle and Localize Concept-level Explanations for VLMs [10.397502254316645]
Large pre-trained Vision-Language Models might not be able to identify fine-grained concepts. We propose to DisEnt and Localize (Angle) concept-level explanations for concepts without human annotations. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability.
arXiv Detail & Related papers (2024-07-19T15:39:19Z)
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.