LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery
- URL: http://arxiv.org/abs/2602.07311v1
- Date: Sat, 07 Feb 2026 02:01:25 GMT
- Title: LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery
- Authors: Difei Gu, Yunhe Gao, Gerasimos Chatzoudis, Zihan Dong, Guoning Zhang, Bangwei Guo, Yang Zhou, Mu Zhou, Dimitris Metaxas,
- Abstract summary: LUCID is a vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations.<n> LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem.<n>Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts.
- Score: 14.222802170483739
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.
Related papers
- VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set [80.50996301430108]
The alignment of vision-language representations endows current Vision-Language Models with strong multi-modal reasoning capabilities.<n>We propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.<n>For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
arXiv Detail & Related papers (2025-10-24T10:29:31Z) - Disentangling Latent Embeddings with Sparse Linear Concept Subspaces (SLiCS) [2.7255100506777894]
Vision-language co-embedding networks, such as CLIP, provide a latent embedding space with semantic information.<n>We propose a supervised dictionary learning approach to estimate a linear synthesis model consisting of sparse, non-negative combinations of groups of vectors.<n>We show that the disentangled embeddings provided by our sparse linear concept subspaces (SLiCS) enable concept-filtered image retrieval.
arXiv Detail & Related papers (2025-08-27T23:39:42Z) - Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability [54.420663939897686]
We propose the Attribute-formed Language Bottleneck Model (ALBM) to achieve interpretable image recognition.<n>ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes.<n>To further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes.
arXiv Detail & Related papers (2025-03-26T07:59:04Z) - Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features.<n>We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z) - Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks.
We present one of the first applications of SAEs to dense text embeddings from large language models.
We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Identifying Interpretable Subspaces in Image Representations [54.821222487956355]
We propose a framework to explain features of image representations using Contrasting Concepts (FALCON)
For a target feature, FALCON captions its highly activating cropped images using a large captioning dataset and a pre-trained vision-language model like CLIP.
Each word among the captions is scored and ranked leading to a small number of shared, human-understandable concepts.
arXiv Detail & Related papers (2023-07-20T00:02:24Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.