When and why vision-language models behave like bags-of-words, and what
to do about it?
- URL: http://arxiv.org/abs/2210.01936v3
- Date: Thu, 23 Mar 2023 23:21:38 GMT
- Title: When and why vision-language models behave like bags-of-words, and what
to do about it?
- Authors: Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky,
James Zou
- Abstract summary: We create the Attribution, Relation, and Order benchmark to evaluate the ability of VLMs to understand different types of relationships, attributes, and order.
ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.
We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity.
- Score: 39.90099818890488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of large vision and language models (VLMs) in many
downstream applications, it is unclear how well they encode compositional
information. Here, we create the Attribution, Relation, and Order (ARO)
benchmark to systematically evaluate the ability of VLMs to understand
different types of relationships, attributes, and order. ARO consists of Visual
Genome Attribution, to test the understanding of objects' properties; Visual
Genome Relation, to test for relational understanding; and COCO &
Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude
larger than previous benchmarks of compositionality, with more than 50,000 test
cases. We show where state-of-the-art VLMs have poor relational understanding,
can blunder when linking objects to their attributes, and demonstrate a severe
lack of order sensitivity. VLMs are predominantly trained and evaluated on
large datasets with rich compositional structure in the images and captions.
Yet, training on these datasets has not been enough to address the lack of
compositional understanding, and evaluating on these datasets has failed to
surface this deficiency. To understand why these limitations emerge and are not
represented in the standard tests, we zoom into the evaluation and training
procedures. We demonstrate that it is possible to perform well on retrieval
over existing datasets without using the composition and order information.
Given that contrastive pretraining optimizes for retrieval on datasets with
similar shortcuts, we hypothesize that this can explain why the models do not
need to learn to represent compositional information. This finding suggests a
natural solution: composition-aware hard negative mining. We show that a
simple-to-implement modification of contrastive learning significantly improves
the performance on tasks requiring understanding of order and compositionality.
Related papers
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models [85.10375181040436]
We propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating Vision-Language Models.
We find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons.
arXiv Detail & Related papers (2024-10-13T05:35:09Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data.
We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses.
Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z) - How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model [4.215221129670858]
We show that by introducing sparsity to generative hierarchical models of data, the task acquires insensitivity to spatial transformations that are discrete versions of smooth transformations.
We quantify how the sample complexity of CNNs learning the SRHM depends on both the sparsity and hierarchical structure of the task.
arXiv Detail & Related papers (2024-04-16T17:01:27Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual
Reasoning Instructions for Visual Instruction Tuning [115.19451843294154]
Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We propose a systematic approach to automatically creating high-quality complex visual reasoning instructions.
Our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.