Linear Spaces of Meanings: Compositional Structures in Vision-Language
Models
- URL: http://arxiv.org/abs/2302.14383v3
- Date: Thu, 11 Jan 2024 18:21:52 GMT
- Title: Linear Spaces of Meanings: Compositional Structures in Vision-Language
Models
- Authors: Matthew Trager, Pramuditha Perera, Luca Zancato, Alessandro Achille,
Parminder Bhatia, Stefano Soatto
- Abstract summary: We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs)
We first present a framework for understanding compositional structures from a geometric perspective.
We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
- Score: 110.00434385712786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate compositional structures in data embeddings from pre-trained
vision-language models (VLMs). Traditionally, compositionality has been
associated with algebraic operations on embeddings of words from a pre-existing
vocabulary. In contrast, we seek to approximate representations from an encoder
as combinations of a smaller set of vectors in the embedding space. These
vectors can be seen as "ideal words" for generating concepts directly within
the embedding space of the model. We first present a framework for
understanding compositional structures from a geometric perspective. We then
explain what these compositional structures entail probabilistically in the
case of VLM embeddings, providing intuitions for why they arise in practice.
Finally, we empirically explore these structures in CLIP's embeddings and we
evaluate their usefulness for solving different vision-language tasks such as
classification, debiasing, and retrieval. Our results show that simple linear
algebraic operations on embedding vectors can be used as compositional and
interpretable methods for regulating the behavior of VLMs.
Related papers
- Optimal synthesis embeddings [1.565361244756411]
We introduce a word embedding composition method based on the intuitive idea that a fair embedding representation for a given set of words should satisfy.
We show that our approach excels in solving probing tasks designed to capture simple linguistic features of sentences.
arXiv Detail & Related papers (2024-06-10T18:06:33Z) - Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) [22.364723506539974]
We show that the semantic structure of CLIP's latent space can be leveraged to provide interpretability.
We propose a novel method, Sparse Linear Concept Embeddings, for transforming CLIP representations into sparse linear combinations of human-interpretable concepts.
arXiv Detail & Related papers (2024-02-16T00:04:36Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - Subspace Representations for Soft Set Operations and Sentence Similarities [17.52824249186434]
We realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces.
By grounding our approach in the linear subspaces, we enable efficient computation of various set operations.
We show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.
arXiv Detail & Related papers (2022-10-24T08:34:10Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Lattice Representation Learning [6.427169570069738]
We introduce theory and algorithms for learning discrete representations that take on a lattice that is embedded in an Euclidean space.
Lattice representations possess an interesting combination of properties: a) they can be computed explicitly using lattice quantization, yet they can be learned efficiently using the ideas we introduce.
This article will focus on laying the groundwork for exploring and exploiting the first two properties, including a new mathematical result linking expressions used during training and inference time and experimental validation on two popular datasets.
arXiv Detail & Related papers (2020-06-24T16:05:11Z) - Multidirectional Associative Optimization of Function-Specific Word
Representations [86.87082468226387]
We present a neural framework for learning associations between interrelated groups of words.
Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together.
The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure.
arXiv Detail & Related papers (2020-05-11T17:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.