Related papers: Evaluating Vision-Language Models on Bistable Images

Evaluating Vision-Language Models on Bistable Images

URL: http://arxiv.org/abs/2405.19423v1
Date: Wed, 29 May 2024 18:04:59 GMT
Title: Evaluating Vision-Language Models on Bistable Images
Authors: Artemis Panagopoulou, Coby Melkin, Chris Callison-Burch,
Abstract summary: This study is the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another.
Score: 34.492117496933915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.

Related papers

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
See or Guess: Counterfactually Regularized Image Captioning [32.82695612178604]
We present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models.
arXiv Detail & Related papers (2024-08-29T17:59:57Z)
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images. Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Semantic and Expressive Variation in Image Captions Across Languages [26.766596770616655]
We study how people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli.<n>By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression.<n>Our work points towards the need to accounttuning for and embrace the diversity of human perception in the computer vision community.
arXiv Detail & Related papers (2023-10-22T16:51:42Z)
The Scenario Refiner: Grounding subjects in images at the morphological level [2.401993998791928]
We ask whether Vision and Language (V&L) models capture such distinctions at the morphological level. We compare the results from V&L models to human judgements and find that models' predictions differ from those of human participants.
arXiv Detail & Related papers (2023-09-20T12:23:06Z)
DeAR: Debiasing Vision-Language Models with Additive Residuals [5.672132510411465]
Large pre-trained vision-language models (VLMs) provide rich, adaptable image and text representations. These models suffer from societal biases owing to the skewed distribution of various identity groups in the training data. We present DeAR, a novel debiasing method that learns additive residual image representations to offset the original representations.
arXiv Detail & Related papers (2023-03-18T14:57:43Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images. We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
Unsupervised Landmark Learning from Unpaired Data [117.81440795184587]
Recent attempts for unsupervised landmark learning leverage synthesized image pairs that are similar in appearance but different in poses. We propose a cross-image cycle consistency framework which applies the swapping-reconstruction strategy twice to obtain the final supervision. Our proposed framework is shown to outperform strong baselines by a large margin.
arXiv Detail & Related papers (2020-06-29T13:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.