Probing Conceptual Understanding of Large Visual-Language Models
- URL: http://arxiv.org/abs/2304.03659v3
- Date: Fri, 26 Apr 2024 16:23:31 GMT
- Title: Probing Conceptual Understanding of Large Visual-Language Models
- Authors: Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat,
- Abstract summary: It is not well studied whether large visual (V+L) models have a conceptual grasp of the visual content.
We propose novel benchmarking datasets for probing three different aspects of content understanding.
Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible.
- Score: 5.3937680430575226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several interesting insights such as that \textit{cross-attention} helps learning conceptual understanding, and that CNNs are better with \textit{texture and patterns}, while Transformers are better at \textit{color and shape}. We further utilize some of these insights and investigate a \textit{simple finetuning technique} that rewards the three conceptual understanding measures with promising initial results. The proposed benchmarks will drive the community to delve deeper into conceptual understanding and foster advancements in the capabilities of large V+L models. The code and dataset is available at: \url{https://tinyurl.com/vlm-robustness}
Related papers
- Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [51.900488744931785]
We introduce the Visual Graph Arena (VGA) to evaluate and improve AI systems' capacity for visual abstraction.<n>Humans achieve near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.<n>By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models.
arXiv Detail & Related papers (2025-06-06T17:06:25Z) - Fill in the blanks: Rethinking Interpretability in vision [0.0]
We re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training.
Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool.
arXiv Detail & Related papers (2024-11-15T15:31:06Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers [11.155818952879146]
Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics.
Can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?
We try to learn such disentangled representations for the case of static images citepnsb, without making any specific assumptions about the kind of attributes that an object might have.
arXiv Detail & Related papers (2024-07-03T15:43:54Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Explaining Explainability: Understanding Concept Activation Vectors [35.37586279472797]
Recent interpretability methods propose using concept-based explanations to translate internal representations of deep learning models into a language that humans are familiar with: concepts.
This requires understanding which concepts are present in the representation space of a neural network.
In this work, we investigate three properties of Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars.
We introduce tools designed to detect the presence of these properties, provide insight into how they affect the derived explanations, and provide recommendations to minimise their impact.
arXiv Detail & Related papers (2024-04-04T17:46:20Z) - Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models [80.32412260877628]
We study how to learn human-interpretable concepts from data.<n> Weaving together ideas from both fields, we show that concepts can be provably recovered from diverse data.
arXiv Detail & Related papers (2024-02-14T15:23:59Z) - Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions [6.231370972617915]
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts.
Existing vision-language alignment models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task.
We leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object)
arXiv Detail & Related papers (2023-11-28T18:55:37Z) - Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text.
Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations.
We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z) - Learning to Collocate Visual-Linguistic Neural Modules for Image
Captioning [80.59607794927363]
We propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (LNCVM)
Unlike the rewidely used neural module networks in VQA, the task of collocating visual-linguistic modules is more challenging.
Our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr-D, and more robust.
Experiments on the MS-COCO dataset show that our CVLNM is more effective,.
achieving a new state-of-the-art 129.5 CIDEr
arXiv Detail & Related papers (2022-10-04T03:09:50Z) - Unpacking Large Language Models with Conceptual Consistency [14.224799628694592]
We propose conceptual consistency to measure a Large Language Model's understanding of relevant concepts.
This novel metric measures how well a model can be characterized by finding out how consistent its responses to queries about conceptually relevant background knowledge are.
arXiv Detail & Related papers (2022-09-29T20:55:57Z) - FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic
descriptions, and Conceptual Relations [99.54048050189971]
We present a framework for learning new visual concepts quickly, guided by multiple naturally occurring data streams.
The learned concepts support downstream applications, such as answering questions by reasoning about unseen images.
We demonstrate the effectiveness of our model on both synthetic and real-world datasets.
arXiv Detail & Related papers (2022-03-30T19:45:00Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z) - Transformation Driven Visual Reasoning [80.32402545546209]
This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.transformation.
We argue that this kind of textbfstate driven visual reasoning approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states.
Experimental results show that the state-of-the-art visual reasoning models perform well on Basic, but are still far from human-level intelligence on Event and View.
arXiv Detail & Related papers (2020-11-26T07:11:31Z) - Assisting Scene Graph Generation with Self-Supervision [21.89909688056478]
We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model.
While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings.
arXiv Detail & Related papers (2020-08-08T16:38:03Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.