What Vision-Language Models `See' when they See Scenes
- URL: http://arxiv.org/abs/2109.07301v1
- Date: Wed, 15 Sep 2021 13:57:39 GMT
- Title: What Vision-Language Models `See' when they See Scenes
- Authors: Michele Cafagna, Kees van Deemter and Albert Gatt
- Abstract summary: We compare 3 state-of-the-art Vision and Language models, VisualBERT, LXMERT and CLIP.
We find that (i) V&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object- and scene-level descriptions.
- Score: 5.027571997864707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Images can be described in terms of the objects they contain, or in terms of
the types of scene or place that they instantiate. In this paper we address to
what extent pretrained Vision and Language models can learn to align
descriptions of both types with images. We compare 3 state-of-the-art models,
VisualBERT, LXMERT and CLIP. We find that (i) V&L models are susceptible to
stylistic biases acquired during pretraining; (ii) only CLIP performs
consistently well on both object- and scene-level descriptions. A follow-up
ablation study shows that CLIP uses object-level information in the visual
modality to align with scene-level textual descriptions.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - Semantically-Prompted Language Models Improve Visual Descriptions [12.267513953980092]
We propose V-GLOSS: Visual Glosses, a novel method for generating expressive visual descriptions.
We show that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets.
arXiv Detail & Related papers (2023-06-05T17:22:54Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions [4.026600887656479]
We investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object.
We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints.
We find that a pre-trained CLIP model performs poorly on most canonical views.
arXiv Detail & Related papers (2023-02-13T15:18:27Z) - I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z) - Understanding Cross-modal Interactions in V&L Models that Generate Scene
Descriptions [3.7957452405531256]
This paper explores the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level.
We show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene.
We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
arXiv Detail & Related papers (2022-11-09T15:33:51Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.