Related papers: Disentangling Dense Embeddings with Sparse Autoencoders

Disentangling Dense Embeddings with Sparse Autoencoders

URL: http://arxiv.org/abs/2408.00657v2
Date: Mon, 5 Aug 2024 03:25:01 GMT
Title: Disentangling Dense Embeddings with Sparse Autoencoders
Authors: Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu,
Abstract summary: Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for fine-grained control over query semantics. This work bridges the gap between the semantic richness of dense embeddings and the interpretability of sparse representations. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.

Related papers

Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval [13.31210969917096]
We propose a novel interpretability framework for Dense Passage Retrieval (DPR) models.<n>We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models.<n>We show that Concept-Level Sparse Retrieval (CL-SR) achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.
arXiv Detail & Related papers (2025-05-28T02:50:17Z)
The Complexity of Learning Sparse Superposed Features with Feedback [0.9838799448847586]
We investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios.
arXiv Detail & Related papers (2025-02-08T01:54:23Z)
Sparse Autoencoder Insights on Voice Embeddings [3.2377830280631468]
This study applies sparse autoencoders to speaker embeddings generated from a Titanet model. The extracted features exhibit characteristics similar to those found in Large Language Model embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding.
arXiv Detail & Related papers (2025-01-31T19:21:43Z)
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z)
Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations. Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents. We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning. The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
Interpreting Embedding Spaces by Conceptualization [2.620130580437745]
We present a novel method of understanding embeddings by transforming a latent embedding space into a comprehensible conceptual space. We devise a new evaluation method, using either human rater or LLM-based raters, to show that the vectors indeed represent the semantics of the original latent ones.
arXiv Detail & Related papers (2022-08-22T15:32:17Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
Robust and Interpretable Grounding of Spatial References with Relation Networks [40.42540299023808]
Learning representations of spatial references in natural language is a key challenge in tasks like autonomous navigation and robotic manipulation. Recent work has investigated various neural architectures for learning multi-modal representations for spatial concepts. We develop effective models for understanding spatial references in text that are robust and interpretable.
arXiv Detail & Related papers (2020-05-02T04:11:33Z)
Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach [36.248702416150124]
We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) Vec2graph is a Python library for visualizing word embeddings (term embeddings in our case) as dynamic and interactive graphs.
arXiv Detail & Related papers (2020-03-06T18:27:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.