Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models
- URL: http://arxiv.org/abs/2310.08577v3
- Date: Wed, 6 Dec 2023 12:34:16 GMT
- Title: Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models
- Authors: Vishaal Udandarao, Max F. Burg, Samuel Albanie, Matthias Bethge
- Abstract summary: We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
- Score: 31.69213233651326
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in the development of vision-language models (VLMs) are
yielding remarkable success in recognizing visual semantic content, including
impressive instances of compositional image understanding. Here, we introduce
the novel task of Visual Data-Type Identification, a basic perceptual skill
with implications for data curation (e.g., noisy data-removal from large
datasets, domain-specific retrieval) and autonomous vision (e.g.,
distinguishing changing weather conditions from camera lens staining). We
develop two datasets consisting of animal images altered across a diverse set
of 27 visual data-types, spanning four broad categories. An extensive zero-shot
evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced
performance landscape. While VLMs are reasonably good at identifying certain
stylistic \textit{data-types}, such as cartoons and sketches, they struggle
with simpler data-types arising from basic manipulations like image rotations
or additive noise. Our findings reveal that (i) model scaling alone yields
marginal gains for contrastively-trained models like CLIP, and (ii) there is a
pronounced drop in performance for the largest auto-regressively trained VLMs
like OpenFlamingo. This finding points to a blind spot in current frontier
VLMs: they excel in recognizing semantic content but fail to acquire an
understanding of visual data-types through scaling. By analyzing the
pre-training distributions of these models and incorporating data-type
information into the captions during fine-tuning, we achieve a significant
enhancement in performance. By exploring this previously uncharted task, we aim
to set the stage for further advancing VLMs to equip them with visual data-type
understanding. Code and datasets are released at
https://github.com/bethgelab/DataTypeIdentification.
Related papers
- Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors.
We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We build an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Why are Visually-Grounded Language Models Bad at Image Classification? [39.76294811955341]
We revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA.
We find that existing proprietary and public VLMs significantly underperform CLIP on standard image classification benchmarks like ImageNet.
Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data.
arXiv Detail & Related papers (2024-05-28T17:57:06Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.