Concept-based Analysis of Neural Networks via Vision-Language Models
- URL: http://arxiv.org/abs/2403.19837v3
- Date: Wed, 10 Apr 2024 23:47:34 GMT
- Title: Concept-based Analysis of Neural Networks via Vision-Language Models
- Authors: Ravi Mangal, Nina Narodytska, Divya Gopinath, Boyue Caroline Hu, Anirban Roy, Susmit Jha, Corina Pasareanu,
- Abstract summary: We propose to leverage emerging multimodal, vision-language, foundation models (VLMs) as a lens through which we can reason about vision models.
We describe a logical specification language $textttCon_textttspec$ designed to facilitate writing specifications in terms of these concepts.
We build a map between the internal representations of a given vision model and a VLM, leading to an efficient verification procedure of natural-language properties for vision models.
- Score: 17.406352568156542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The analysis of vision-based deep neural networks (DNNs) is highly desirable but it is very challenging due to the difficulty of expressing formal specifications for vision tasks and the lack of efficient verification procedures. In this paper, we propose to leverage emerging multimodal, vision-language, foundation models (VLMs) as a lens through which we can reason about vision models. VLMs have been trained on a large body of images accompanied by their textual description, and are thus implicitly aware of high-level, human-understandable concepts describing the images. We describe a logical specification language $\texttt{Con}_{\texttt{spec}}$ designed to facilitate writing specifications in terms of these concepts. To define and formally check $\texttt{Con}_{\texttt{spec}}$ specifications, we build a map between the internal representations of a given vision model and a VLM, leading to an efficient verification procedure of natural-language properties for vision models. We demonstrate our techniques on a ResNet-based classifier trained on the RIVAL-10 dataset using CLIP as the multimodal model.
Related papers
- Language Model as Visual Explainer [72.88137795439407]
We present a systematic approach for interpreting vision models using a tree-structured linguistic explanation.
Our method provides human-understandable explanations in the form of attribute-laden trees.
To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations.
arXiv Detail & Related papers (2024-12-08T20:46:23Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.
We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z) - VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models [2.0718016474717196]
integrated Vision and Language Models (VLMs) are frequently regarded as black boxes within the machine learning research community.
We present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments.
We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model's decision-making process.
arXiv Detail & Related papers (2024-10-06T20:11:53Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.