Benchmarking VLMs' Reasoning About Persuasive Atypical Images
- URL: http://arxiv.org/abs/2409.10719v2
- Date: Thu, 10 Oct 2024 14:47:58 GMT
- Title: Benchmarking VLMs' Reasoning About Persuasive Atypical Images
- Authors: Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka,
- Abstract summary: Vision language models (VLMs) have shown strong zero-shot generalization across various tasks.
Their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied.
We introduce three novel tasks to benchmark VLMs' understanding of atypicality in persuasive images.
- Score: 31.944810096834104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
Related papers
- Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images [19.923665989164387]
We propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge Large Language Models.
Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues.
Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped.
arXiv Detail & Related papers (2024-08-15T12:04:32Z) - If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions [9.190831897944957]
Vision-Language Model (VLM) representations are often based on visual attributes like shape.
We propose Extract and Explore (EX2), a novel approach to characterize important textual features.
We show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
arXiv Detail & Related papers (2024-03-25T06:05:50Z) - How Far Are We from Intelligent Visual Deductive Reasoning? [41.4377002379162]
We dig into vision-based deductive reasoning, a more sophisticated but less explored realm.
We find previously unexposed blindspots in the current SOTA VLMs.
We find that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks.
arXiv Detail & Related papers (2024-03-07T18:35:54Z) - Context Disentangling and Prototype Inheriting for Robust Visual
Grounding [56.63007386345772]
Visual grounding (VG) aims to locate a specific target in an image based on a given language query.
We propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes.
Our method outperforms the state-of-the-art methods in both scenarios.
arXiv Detail & Related papers (2023-12-19T09:03:53Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
Adaptation of Vision-Language Models [40.54372699488922]
We perform the first empirical study of image ad understanding through the lens of pre-trained vision-language models (VLMs)
We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities.
arXiv Detail & Related papers (2023-05-28T04:49:01Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Unifying Vision-Language Representation Space with Single-tower
Transformer [29.604520441315135]
We train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner.
We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces.
arXiv Detail & Related papers (2022-11-21T02:34:21Z) - CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language
Learning [78.3857991931479]
We present GROLLA, an evaluation framework for Grounded Language Learning with Attributes.
We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations.
arXiv Detail & Related papers (2020-06-03T11:21:42Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.