Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)
- URL: http://arxiv.org/abs/2508.20322v1
- Date: Wed, 27 Aug 2025 23:39:42 GMT
- Title: Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS)
- Authors: Zhi Li, Hau Phan, Matthew Emigh, Austin J. Brockmeier,
- Abstract summary: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information.<n>We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors.<n>We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval.
- Score: 2.7255100506777894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information that is useful for downstream tasks. We hypothesize that the embedding space can be disentangled to separate the information on the content of complex scenes by decomposing the embedding into multiple concept-specific component vectors that lie in different subspaces. We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors in the dictionary (atoms), whose group-wise activity matches the multi-label information. Each concept-specific component is a non-negative combination of atoms associated to a label. The group-structured dictionary is optimized through a novel alternating optimization with guaranteed convergence. Exploiting the text co-embeddings, we detail how semantically meaningful descriptions can be found based on text embeddings of words best approximated by a concept's group of atoms, and unsupervised dictionary learning can exploit zero-shot classification of training set images using the text embeddings of concept labels to provide instance-wise multi-labels. We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval (and conditional generation using image-to-prompt) that is more precise. We also apply SLiCS to highly-compressed autoencoder embeddings from TiTok and the latent embedding from self-supervised DINOv2. Quantitative and qualitative results highlight the improved precision of the concept-filtered image retrieval for all embeddings.
Related papers
- LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery [14.222802170483739]
LUCID is a vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations.<n> LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem.<n>Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts.
arXiv Detail & Related papers (2026-02-07T02:01:25Z) - Decomposing multimodal embedding spaces with group-sparse autoencoders [4.817429789586128]
We propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization.<n>We show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity.
arXiv Detail & Related papers (2026-01-27T20:04:07Z) - Cross-Layer Discrete Concept Discovery for Interpreting Language Models [13.842670153893977]
Cross-layer VQ-VAE is a framework that uses vector quantization to map representations across layers.<n>Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates.
arXiv Detail & Related papers (2025-06-24T22:43:36Z) - Descriminative-Generative Custom Tokens for Vision-Language Models [101.40245125955306]
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs)<n>Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries.
arXiv Detail & Related papers (2025-02-17T18:13:42Z) - Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding.<n>We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks.<n>Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.<n>We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space.<n>Our approach yields significantly improved capabilities in ranking images by a certain attribute, and improved zeroshot classification performance on many downstream image classification tasks.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Linear Spaces of Meanings: Compositional Structures in Vision-Language
Models [110.00434385712786]
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs)
We first present a framework for understanding compositional structures from a geometric perspective.
We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
arXiv Detail & Related papers (2023-02-28T08:11:56Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - On the Learnability of Concepts: With Applications to Comparing Word
Embedding Algorithms [0.0]
We introduce the notion of "concept" as a list of words that have shared semantic content.
We first use this notion to measure the learnability of concepts on pretrained word embeddings.
We then develop a statistical analysis of concept learnability, based on hypothesis testing and ROC curves, in order to compare the relative merits of various embedding algorithms.
arXiv Detail & Related papers (2020-06-17T14:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.