Can Large Multimodal Models Uncover Deep Semantics Behind Images?
- URL: http://arxiv.org/abs/2402.11281v3
- Date: Thu, 20 Jun 2024 13:12:31 GMT
- Title: Can Large Multimodal Models Uncover Deep Semantics Behind Images?
- Authors: Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui,
- Abstract summary: We introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' capacities of visual deep semantics.
We evaluate 9 open-source LMMs and GPT-4V(ision)
For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description.
- Score: 29.399943397718815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
Related papers
- Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports [10.925743866700037]
We study the inferential abilities of large vision-language models on texts related to amodal completion.<n>Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed.<n>Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
arXiv Detail & Related papers (2025-07-08T09:06:47Z) - HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning [0.0]
We propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for low-light image enhancement (LLIE)
Our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.
arXiv Detail & Related papers (2025-04-27T06:56:07Z) - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.
These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.
Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.
Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.
We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [57.66267515456075]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations.
We propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences? [32.61269125015993]
StripCipher is a benchmark designed to evaluate capabilities of Large Multimodal Models (LMMs) to comprehend and reason over sequential images.
StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering.
Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities.
arXiv Detail & Related papers (2025-02-19T18:04:44Z) - HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.
Each task features carefully crafted diagrams paired with function signatures and test cases.
We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.
Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits.
Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations [13.608653575298183]
This paper introduces the VISLA benchmark, designed to evaluate the semantic and lexical understanding of language models.
An evaluation involving 34 vision-language models (VLMs) and 20 unimodal language models (ULMs) reveals surprising difficulties in distinguishing between lexical and semantic variations.
arXiv Detail & Related papers (2024-04-25T07:08:00Z) - Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations.
Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations.
Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents.
We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z) - Unsupervised discovery of Interpretable Visual Concepts [0.0]
We propose two methods to explain a model's decision, enhancing global interpretability.
One method is inspired by Occlusion and Sensitivity analysis (incorporating causality)
The other method uses a novel metric, called Class-aware Order Correlation (CaOC), to globally evaluate the most important image regions.
arXiv Detail & Related papers (2023-08-31T07:53:02Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware
Feature Extraction [27.750031877854717]
We propose SAFENet that is designed to leverage semantic information to overcome the limitations of the photometric loss.
Our key idea is to exploit semantic-aware depth features that integrate the semantic and geometric knowledge.
Experiments on KITTI dataset demonstrate that our methods compete or even outperform the state-of-the-art methods.
arXiv Detail & Related papers (2020-10-06T17:22:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.