Visual Commonsense in Pretrained Unimodal and Multimodal Models
- URL: http://arxiv.org/abs/2205.01850v1
- Date: Wed, 4 May 2022 02:07:55 GMT
- Title: Visual Commonsense in Pretrained Unimodal and Multimodal Models
- Authors: Chenyu Zhang, Benjamin Van Durme, Zhuowan Li, Elias Stengel-Eskin
- Abstract summary: We investigate what degree unimodal (language-only) and multimodal (image and language) models capture a broad range of visually salient attributes.
We create the Visual Commonsense Tests dataset covering 5 property types (color, shape, material, size, and visual co-occurrence) for over 5000 subjects.
We then use our dataset to evaluate pretrained unimodal models and multimodal models.
- Score: 29.462625570767123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our commonsense knowledge about objects includes their typical visual
attributes; we know that bananas are typically yellow or green, and not purple.
Text and image corpora, being subject to reporting bias, represent this
world-knowledge to varying degrees of faithfulness. In this paper, we
investigate to what degree unimodal (language-only) and multimodal (image and
language) models capture a broad range of visually salient attributes. To that
end, we create the Visual Commonsense Tests (ViComTe) dataset covering 5
property types (color, shape, material, size, and visual co-occurrence) for
over 5000 subjects. We validate this dataset by showing that our grounded color
data correlates much better than ungrounded text-only data with crowdsourced
color judgments provided by Paik et al. (2021). We then use our dataset to
evaluate pretrained unimodal models and multimodal models. Our results indicate
that multimodal models better reconstruct attribute distributions, but are
still subject to reporting bias. Moreover, increasing model size does not
enhance performance, suggesting that the key to visual commonsense lies in the
data.
Related papers
- What to do if language models disagree? Black-box model ensembling for textual and visual question answering [2.1439084103679273]
We introduce InfoSel, a data-efficient and lightweight ensemble method that learns to pick the winner from existing black-box models.
We show that our approach achieves an absolute increase of up to +5.27% in the F1-score compared to standalone LLMs.
arXiv Detail & Related papers (2024-07-04T12:59:10Z) - Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z) - Confidence-based Ensembles of End-to-End Speech Recognition Models [71.65982591023581]
We show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block.
We also demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.
arXiv Detail & Related papers (2023-06-27T23:13:43Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - The World of an Octopus: How Reporting Bias Influences a Language
Model's Perception of Color [73.70233477125781]
We show that reporting bias negatively impacts and inherently limits text-only training.
We then demonstrate that multimodal models can leverage their visual training to mitigate these effects.
arXiv Detail & Related papers (2021-10-15T16:28:17Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.