Describing Sets of Images with Textual-PCA
- URL: http://arxiv.org/abs/2210.12112v1
- Date: Fri, 21 Oct 2022 17:10:49 GMT
- Title: Describing Sets of Images with Textual-PCA
- Authors: Oded Hupert, Idan Schwartz, Lior Wolf
- Abstract summary: We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set.
Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases.
- Score: 89.46499914148993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We seek to semantically describe a set of images, capturing both the
attributes of single images and the variations within the set. Our procedure is
analogous to Principle Component Analysis, in which the role of projection
vectors is replaced with generated phrases. First, a centroid phrase that has
the largest average semantic similarity to the images in the set is generated,
where both the computation of the similarity and the generation are based on
pretrained vision-language models. Then, the phrase that generates the highest
variation among the similarity scores is generated, using the same models. The
next phrase maximizes the variance subject to being orthogonal, in the latent
space, to the highest-variance phrase, and the process continues. Our
experiments show that our method is able to convincingly capture the essence of
image sets and describe the individual elements in a semantically meaningful
way within the context of the entire set. Our code is available at:
https://github.com/OdedH/textual-pca.
Related papers
- EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence.
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z) - Visual Information Guided Zero-Shot Paraphrase Generation [71.33405403748237]
We propose visual information guided zero-shot paraphrase generation (ViPG) based only on paired image-caption data.
It jointly trains an image captioning model and a paraphrasing model and leverage the image captioning model to guide the training of the paraphrasing model.
Both automatic evaluation and human evaluation show our model can generate paraphrase with good relevancy, fluency and diversity.
arXiv Detail & Related papers (2022-01-22T18:10:39Z) - Semantic Distribution-aware Contrastive Adaptation for Semantic
Segmentation [50.621269117524925]
Domain adaptive semantic segmentation refers to making predictions on a certain target domain with only annotations of a specific source domain.
We present a semantic distribution-aware contrastive adaptation algorithm that enables pixel-wise representation alignment.
We evaluate SDCA on multiple benchmarks, achieving considerable improvements over existing algorithms.
arXiv Detail & Related papers (2021-05-11T13:21:25Z) - Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and
Word2Vec to generate Object and Scene Embeddings from Images [0.0]
We develop two approaches for learning object and scene embeddings from annotated images.
In the first approach, we generate embeddings from object co-occurrences in whole images, one for objects and one for scenes.
In the second approach, rather than analyzing whole images of scenes, we focus on co-occurrences of objects within subregions of an image.
arXiv Detail & Related papers (2020-09-20T08:26:38Z) - Cross-domain Correspondence Learning for Exemplar-based Image
Translation [59.35767271091425]
We present a framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain.
The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar.
We show that our method is superior to state-of-the-art methods in terms of image quality significantly.
arXiv Detail & Related papers (2020-04-12T09:10:57Z) - Structural-analogy from a Single Image Pair [118.61885732829117]
In this paper, we explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B.
We generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A.
Our method can be used to generate high quality imagery in other conditional generation tasks utilizing images A and B only.
arXiv Detail & Related papers (2020-04-05T14:51:10Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.