CREPE: Can Vision-Language Foundation Models Reason Compositionally?
- URL: http://arxiv.org/abs/2212.07796v3
- Date: Tue, 16 May 2023 16:27:08 GMT
- Title: CREPE: Can Vision-Language Foundation Models Reason Compositionally?
- Authors: Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao,
Ranjay Krishna
- Abstract summary: We introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity.
For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set.
For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity.
- Score: 10.958279688917434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental characteristic common to both human vision and natural language
is their compositional nature. Yet, despite the performance gains contributed
by large vision and language pretraining, we find that: across 7 architectures
trained with 4 algorithms on massive datasets, they struggle at
compositionality. To arrive at this conclusion, we introduce a new
compositionality evaluation benchmark, CREPE, which measures two important
aspects of compositionality identified by cognitive science literature:
systematicity and productivity. To measure systematicity, CREPE consists of a
test dataset containing over $370K$ image-text pairs and three different
seen-unseen splits. The three splits are designed to test models trained on
three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also
generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the
pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine
different complexities plus $183K$ hard negative captions with atomic, swapping
and negation foils. The datasets are generated by repurposing the Visual Genome
scene graphs and region descriptions and applying handcrafted templates and
GPT-3. For systematicity, we find that model performance decreases consistently
when novel compositions dominate the retrieval set, with Recall@1 dropping by
up to $12\%$. For productivity, models' retrieval success decays as complexity
increases, frequently nearing random chance at high complexity. These results
hold regardless of model and training dataset size.
Related papers
- Can Models Learn Skill Composition from Examples? [50.5142714905768]
We evaluate the capacity of smaller models to learn compositional generalization from examples.
We show that training on combinations of $k=2$ and $3$ skills results in noticeable improvements in the ability to compose texts.
This study also suggests that incorporating skill-rich (potentially synthetic) text into training can substantially enhance the compositional capabilities of models.
arXiv Detail & Related papers (2024-09-29T22:14:02Z) - An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set [0.0]
Under default settings, Human-Object Interaction (HOI) performance is nearly saturated.
This study uses two experimental settings: grounding truth and random arbitrary combinations.
We find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized.
arXiv Detail & Related papers (2024-08-11T13:40:02Z) - $\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs [62.565573316667276]
We develop an objective that encodes how a sample relates to others.
We train vision models based on similarities in class or text caption descriptions.
Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8%$ on ImageNet and $18.1%$ on ImageNet Real.
arXiv Detail & Related papers (2024-07-25T15:38:16Z) - CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples [34.71588837946776]
We propose CounterCurate, a framework to improve visio-linguistic compositional reasoning.
In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning.
We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning.
We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements.
arXiv Detail & Related papers (2024-02-20T18:59:55Z) - Towards Unseen Triples: Effective Text-Image-joint Learning for Scene
Graph Generation [30.79358827005448]
Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images.
Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets.
We propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.
arXiv Detail & Related papers (2023-06-23T10:17:56Z) - Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - When and why vision-language models behave like bags-of-words, and what
to do about it? [39.90099818890488]
We create the Attribution, Relation, and Order benchmark to evaluate the ability of VLMs to understand different types of relationships, attributes, and order.
ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.
We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity.
arXiv Detail & Related papers (2022-10-04T22:13:25Z) - Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training.
We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples.
For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.