Compositional Mixture Representations for Vision and Text
- URL: http://arxiv.org/abs/2206.06404v1
- Date: Mon, 13 Jun 2022 18:16:40 GMT
- Title: Compositional Mixture Representations for Vision and Text
- Authors: Stephan Alaniz, Marco Federici, Zeynep Akata
- Abstract summary: A common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning.
We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision.
- Score: 43.2292923754127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning a common representation space between vision and language allows
deep networks to relate objects in the image to the corresponding semantic
meaning. We present a model that learns a shared Gaussian mixture
representation imposing the compositionality of the text onto the visual domain
without having explicit location supervision. By combining the spatial
transformer with a representation learning approach we learn to split images
into separately encoded patches to associate visual and textual representations
in an interpretable manner. On variations of MNIST and CIFAR10, our model is
able to perform weakly supervised object detection and demonstrates its ability
to extrapolate to unseen combination of objects.
Related papers
- Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.
We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z) - Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - SimAN: Exploring Self-Supervised Representation Learning of Scene Text
via Similarity-Aware Normalization [66.35116147275568]
Self-supervised representation learning has drawn considerable attention from the scene text recognition community.
We tackle the issue by formulating the representation learning scheme in a generative manner.
We propose a Similarity-Aware Normalization (SimAN) module to identify the different patterns and align the corresponding styles from the guiding patch.
arXiv Detail & Related papers (2022-03-20T08:43:10Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Image Captioning with Visual Object Representations Grounded in the
Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality.
We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.