Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
- URL: http://arxiv.org/abs/2502.12892v1
- Date: Tue, 18 Feb 2025 14:29:11 GMT
- Title: Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
- Authors: Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, Talia Konkle,
- Abstract summary: Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability.
Existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries.
We present Archetypal SAEs, wherein dictionary atoms are constrained to the convex hull of data.
- Score: 16.894375498353092
- License:
- Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
Related papers
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control [43.860799289234755]
We propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against emphmagnitude feature dictionaries.
First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task.
We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets.
arXiv Detail & Related papers (2024-05-14T07:07:13Z) - Towards a Fully Interpretable and More Scalable RSA Model for Metaphor Understanding [0.8437187555622164]
The Rational Speech Act (RSA) model provides a flexible framework to model pragmatic reasoning in computational terms.
Here, we introduce a new RSA framework for metaphor understanding that addresses limitations by providing an explicit formula.
The model was tested against 24 metaphors, not limited to the conventional $textitJohn-is-a-shark$ type.
arXiv Detail & Related papers (2024-04-03T18:09:33Z) - On the Tip of the Tongue: Analyzing Conceptual Representation in Large
Language Models with Reverse-Dictionary Probe [36.65834065044746]
We use in-context learning to guide the models to generate the term for an object concept implied in a linguistic description.
Experiments suggest that conceptual inference ability as probed by the reverse-dictionary task predicts model's general reasoning performance.
arXiv Detail & Related papers (2024-02-22T09:45:26Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - Syntactically Robust Training on Partially-Observed Data for Open
Information Extraction [25.59133746149343]
Open Information Extraction models have shown promising results with sufficient supervision.
We propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution.
arXiv Detail & Related papers (2023-01-17T12:39:13Z) - Equivariant Transduction through Invariant Alignment [71.45263447328374]
We introduce a novel group-equivariant architecture that incorporates a group-in hard alignment mechanism.
We find that our network's structure allows it to develop stronger equivariant properties than existing group-equivariant approaches.
We additionally find that it outperforms previous group-equivariant networks empirically on the SCAN task.
arXiv Detail & Related papers (2022-09-22T11:19:45Z) - The King is Naked: on the Notion of Robustness for Natural Language
Processing [18.973116252065278]
We argue for semantic robustness, which is better aligned with the human concept of linguistic fidelity.
We study semantic robustness of a range of vanilla and robustly trained architectures using a template-based generative test bed.
arXiv Detail & Related papers (2021-12-13T16:19:48Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Introducing Syntactic Structures into Target Opinion Word Extraction
with Deep Learning [89.64620296557177]
We propose to incorporate the syntactic structures of the sentences into the deep learning models for targeted opinion word extraction.
We also introduce a novel regularization technique to improve the performance of the deep learning models.
The proposed model is extensively analyzed and achieves the state-of-the-art performance on four benchmark datasets.
arXiv Detail & Related papers (2020-10-26T07:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.