Parts of Speech-Grounded Subspaces in Vision-Language Models
- URL: http://arxiv.org/abs/2305.14053v2
- Date: Sun, 12 Nov 2023 20:17:47 GMT
- Title: Parts of Speech-Grounded Subspaces in Vision-Language Models
- Authors: James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis A.
Nicolaou, Ioannis Patras
- Abstract summary: We propose to separate representations of different visual modalities in CLIP's joint vision-language space.
We learn subspaces capturing variability corresponding to a specific part of speech, while minimising variability to the rest.
We show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances.
- Score: 32.497303059356334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Latent image representations arising from vision-language models have proved
immensely useful for a variety of downstream tasks. However, their utility is
limited by their entanglement with respect to different visual attributes. For
instance, recent work has shown that CLIP image representations are often
biased toward specific visual properties (such as objects or actions) in an
unpredictable manner. In this paper, we propose to separate representations of
the different visual modalities in CLIP's joint vision-language space by
leveraging the association between parts of speech and specific visual modes of
variation (e.g. nouns relate to objects, adjectives describe appearance). This
is achieved by formulating an appropriate component analysis model that learns
subspaces capturing variability corresponding to a specific part of speech,
while jointly minimising variability to the rest. Such a subspace yields
disentangled representations of the different visual properties of an image or
text in closed form while respecting the underlying geometry of the manifold on
which the representations lie. What's more, we show the proposed model
additionally facilitates learning subspaces corresponding to specific visual
appearances (e.g. artists' painting styles), which enables the selective
removal of entire visual themes from CLIP-based text-to-image synthesis. We
validate the model both qualitatively, by visualising the subspace projections
with a text-to-image model and by preventing the imitation of artists' styles,
and quantitatively, through class invariance metrics and improvements to
baseline zero-shot classification.
Related papers
- U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation [18.841473623776153]
State-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space.
A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes.
Experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts.
arXiv Detail & Related papers (2024-03-29T15:20:34Z) - Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z) - Leverage Points in Modality Shifts: Comparing Language-only and
Multimodal Word Representations [0.8594140167290097]
Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models.
Our paper compares word embeddings from three vision-and-language models and three text-only models, with static and contextual representations.
This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters.
arXiv Detail & Related papers (2023-06-04T12:53:12Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.