Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
- URL: http://arxiv.org/abs/2406.09388v1
- Date: Thu, 13 Jun 2024 17:58:39 GMT
- Title: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
- Authors: Youngtaek Oh, Pyunghwan Ahn, Jinhyung Kim, Gwangmo Song, Soonyoung Lee, In So Kweon, Junmo Kim,
- Abstract summary: Vision and language models (VLMs) have showcased remarkable zero-shot recognition abilities.
But they face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment.
This paper explores the intricate relationship between compositionality and recognition.
- Score: 61.956088652094515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition -- two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at https://github.com/ytaek-oh/vl_compo.
Related papers
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models [85.10375181040436]
We propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating Vision-Language Models.
We find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons.
arXiv Detail & Related papers (2024-10-13T05:35:09Z) - LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification [5.8754760054410955]
We introduce textttHi-CoDecomposition, a novel framework designed to enhance model interpretability through structured concept analysis.
Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process.
arXiv Detail & Related papers (2024-05-29T00:36:56Z) - Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View [26.52297849056656]
Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to compositional reasoning.
We propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding.
arXiv Detail & Related papers (2024-05-27T14:22:03Z) - Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Visual-Semantic Embedding Model Informed by Structured Knowledge [3.2734466030053175]
We propose a novel approach to improve a visual-semantic embedding model by incorporating concept representations captured from an external structured knowledge base.
We investigate its performance on image classification under both standard and zero-shot settings.
arXiv Detail & Related papers (2020-09-21T17:04:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.