Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View
- URL: http://arxiv.org/abs/2405.17201v1
- Date: Mon, 27 May 2024 14:22:03 GMT
- Title: Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View
- Authors: Jin Wang, Shichao Dong, Yapeng Zhu, Kelu Yao, Weidong Zhao, Chao Li, Ping Luo,
- Abstract summary: Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to compositional reasoning.
We propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding.
- Score: 26.52297849056656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for this weakness. Specifically, we propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding, e.g., relations and attributes. Extensive experimental results demonstrate and validate several insights to understand the incapabilities of VLMs on compositional reasoning, which provide useful and reliable guidance for future studies. The deliverables will be updated at https://vlms-compositionality-gametheory.github.io/.
Related papers
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models [85.10375181040436]
We propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating Vision-Language Models.
We find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons.
arXiv Detail & Related papers (2024-10-13T05:35:09Z) - Do Vision-Language Models Really Understand Visual Language? [43.893398898373995]
Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image.
Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams.
This paper develops a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs.
arXiv Detail & Related papers (2024-09-30T19:45:11Z) - Beyond the Hype: A dispassionate look at vision-language models in medical scenario [3.4299097748670255]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks.
Their performance and reliability in specialized domains such as medicine remain insufficiently assessed.
We introduce RadVUQA, a novel benchmark to comprehensively evaluate existing LVLMs.
arXiv Detail & Related papers (2024-08-16T12:32:44Z) - Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images [19.923665989164387]
We propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge Large Language Models.
Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues.
Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped.
arXiv Detail & Related papers (2024-08-15T12:04:32Z) - In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data.
We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses.
Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z) - Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition [61.956088652094515]
Vision and language models (VLMs) have showcased remarkable zero-shot recognition abilities.
But they face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment.
This paper explores the intricate relationship between compositionality and recognition.
arXiv Detail & Related papers (2024-06-13T17:58:39Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering [53.70661720114377]
multimodal large models (MLMs) have significantly advanced the field of visual understanding, offering remarkable capabilities in realm of visual question answering (VQA)
Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate deep comprehension of the visual information in conjunction with a vast repository of learned knowledge.
To uncover such capabilities, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing
arXiv Detail & Related papers (2023-11-13T18:22:32Z) - Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.