If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
- URL: http://arxiv.org/abs/2403.16442v1
- Date: Mon, 25 Mar 2024 06:05:50 GMT
- Title: If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
- Authors: Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach,
- Abstract summary: Vision-Language Model (VLM) representations are often based on visual attributes like shape.
We propose Extract and Explore (EX2), a novel approach to characterize important textual features.
We show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
- Score: 9.190831897944957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize important textual features for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate the important features for the VLM. Then, we inspect the descriptions to identify the features that contribute to VLM representations. We find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
Related papers
- BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model [52.697180472760635]
This paper explores the potential of character identities memory and recognition across multiple visual scenarios.
We propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM.
Our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions.
arXiv Detail & Related papers (2024-07-10T12:11:59Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - MyVLM: Personalizing VLMs for User-Specific Queries [78.33252556805931]
We take a first step toward the personalization of vision-language models, enabling them to learn and reason over user-provided concepts.
To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model.
Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM.
This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response.
arXiv Detail & Related papers (2024-03-21T17:51:01Z) - CoLLaVO: Crayon Large Language and Vision mOdel [42.182009352159]
It remains unexplored whether current Vision Language Models genuinely possess quality object-level image understanding capabilities.
Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.
To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO)
We present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning.
arXiv Detail & Related papers (2024-02-17T11:03:02Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - Incorporating Structured Representations into Pretrained Vision &
Language Models Using Scene Graphs [79.64891686479213]
We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs)
For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions.
Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
arXiv Detail & Related papers (2023-05-10T17:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.