Related papers: If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

URL: http://arxiv.org/abs/2403.16442v1
Date: Mon, 25 Mar 2024 06:05:50 GMT
Title: If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Authors: Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach,
Abstract summary: Vision-Language Model (VLM) representations are often based on visual attributes like shape. We propose Extract and Explore (EX2), a novel approach to characterize important textual features. We show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
Score: 9.190831897944957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize important textual features for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate the important features for the VLM. Then, we inspect the descriptions to identify the features that contribute to VLM representations. We find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.

Related papers

Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z)
What's in the Image? A Deep-Dive into the Vision of Vision Language Models [20.669971132114195]
Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data.
arXiv Detail & Related papers (2024-11-26T14:59:06Z)
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels. We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z)
How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?" Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z)
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM) Recent studies show that VLMs are vulnerable to hallucination. We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z)
An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology. We introduce what VLMs are, how they work, and how to train them. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z)
Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning. We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z)
MyVLM: Personalizing VLMs for User-Specific Queries [78.33252556805931]
We take a first step toward the personalization of vision-language models, enabling them to learn and reason over user-provided concepts. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response.
arXiv Detail & Related papers (2024-03-21T17:51:01Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs [79.64891686479213]
We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs) For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions. Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
arXiv Detail & Related papers (2023-05-10T17:52:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.