Understanding Cross-modal Interactions in V&L Models that Generate Scene
Descriptions
- URL: http://arxiv.org/abs/2211.04971v2
- Date: Thu, 10 Nov 2022 16:49:37 GMT
- Title: Understanding Cross-modal Interactions in V&L Models that Generate Scene
Descriptions
- Authors: Michele Cafagna, Kees van Deemter, Albert Gatt
- Abstract summary: This paper explores the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level.
We show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene.
We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
- Score: 3.7957452405531256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning models tend to describe images in an object-centric way,
emphasising visible objects. But image descriptions can also abstract away from
objects and describe the type of scene depicted. In this paper, we explore the
potential of a state-of-the-art Vision and Language model, VinVL, to caption
images at the scene level using (1) a novel dataset which pairs images with
both object-centric and scene descriptions. Through (2) an in-depth analysis of
the effect of the fine-tuning, we show (3) that a small amount of curated data
suffices to generate scene descriptions without losing the capability to
identify object-level concepts in the scene; the model acquires a more holistic
view of the image compared to when object-centric descriptions are generated.
We discuss the parallels between these results and insights from computational
and cognitive science research on scene perception.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Semantically-aware Neural Radiance Fields for Visual Scene
Understanding: A Comprehensive Review [26.436253160392123]
Review thoroughly examines the role of semantically-aware Neural Radiance Fields (NeRFs) in visual scene understanding.
NeRFs adeptly infer 3D representations for both stationary and dynamic objects in a scene.
arXiv Detail & Related papers (2024-02-17T00:15:09Z) - Paparazzi: A Deep Dive into the Capabilities of Language and Vision
Models for Grounding Viewpoint Descriptions [4.026600887656479]
We investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object.
We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints.
We find that a pre-trained CLIP model performs poorly on most canonical views.
arXiv Detail & Related papers (2023-02-13T15:18:27Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - What Vision-Language Models `See' when they See Scenes [5.027571997864707]
We compare 3 state-of-the-art Vision and Language models, VisualBERT, LXMERT and CLIP.
We find that (i) V&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object- and scene-level descriptions.
arXiv Detail & Related papers (2021-09-15T13:57:39Z) - Neural Scene Graphs for Dynamic Scenes [57.65413768984925]
We present the first neural rendering method that decomposes dynamic scenes into scene graphs.
We learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function.
arXiv Detail & Related papers (2020-11-20T12:37:10Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.