Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
- URL: http://arxiv.org/abs/2410.06981v4
- Date: Wed, 21 May 2025 00:01:46 GMT
- Title: Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
- Authors: Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez,
- Abstract summary: The Universality Hypothesis claims that different models converge towards similar concept representations in their latent spaces.<n>We introduce a new variation of the universality hypothesis called Analogous Feature Universality.
- Score: 14.594698598522797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Universality Hypothesis in large language models (LLMs) claims that different models converge towards similar concept representations in their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models. Previous works studied if LLMs learned the same features, which are internal representations that activate on specific concepts. Since comparing features across LLMs is challenging due to polysemanticity, in which LLM neurons often correspond to multiple unrelated features rather than to distinct concepts, sparse autoencoders (SAEs) have been employed to disentangle LLM neurons into SAE features corresponding to distinct concepts. In this paper, we introduce a new variation of the universality hypothesis called Analogous Feature Universality: we hypothesize that even if SAEs across different models learn different feature representations, the spaces spanned by SAE features are similar, such that one SAE space is similar to another SAE space under rotation-invariant transformations. Evidence for this hypothesis would imply that interpretability techniques related to latent spaces, such as steering vectors, may be transferred across models via certain transformations. To investigate this hypothesis, we first pair SAE features across different models via activation correlation, and then measure spatial relation similarities between paired features via representational similarity measures, which transform spaces into representations that reveal hidden relational similarities. Our experiments demonstrate high similarities for SAE feature spaces across various LLMs, providing evidence for feature space universality.
Related papers
- Semantic Convergence: Investigating Shared Representations Across Scaled LLMs [4.172347145536457]
Large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.<n>Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models.
arXiv Detail & Related papers (2025-07-21T07:09:32Z) - Transferring Features Across Language Models With Model Stitching [61.24716360332365]
We show that affine mappings between residual streams of language models is a cheap way to transfer represented features between models.<n>We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings.
arXiv Detail & Related papers (2025-06-07T01:03:25Z) - Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models [6.390475802910619]
We show that representations learned across models trained on the same data can be expressed as linear combinations of a emphuniversal set of basis features.<n>These basis features underlie the learning task itself and remain consistent across models, regardless of scale.
arXiv Detail & Related papers (2025-05-31T17:45:18Z) - The Origins of Representation Manifolds in Large Language Models [52.68554895844062]
We show that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths.<n>The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
arXiv Detail & Related papers (2025-05-23T13:31:22Z) - Enhancing Interpretability of Sparse Latent Representations with Class Information [1.2835555561822447]
Variational Autoencoders (VAEs) are powerful generative models for learning latent representations.<n>We propose a novel approach to enhance the latent space interpretability by ensuring that the active dimensions in the latent space are consistent across samples within the same class.
arXiv Detail & Related papers (2025-05-20T15:10:01Z) - Semantic Wave Functions: Exploring Meaning in Large Language Models Through Quantum Formalism [0.0]
Large Language Models (LLMs) encode semantic relationships in high-dimensional vector embeddings.<n>This paper explores the analogy between LLM embedding spaces and quantum mechanics.<n>We introduce a "semantic wave function" to formalize this quantum-derived representation.
arXiv Detail & Related papers (2025-03-09T08:23:31Z) - LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.
We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.
Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z) - Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.
SSL features conflict with traditional spectral features like FBanks in terms of update directions.
We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z) - On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages [56.22289522687125]
Selective state-space models (SSMs) are an emerging alternative to the Transformer.
We analyze their expressiveness and length generalization performance on regular language tasks.
We introduce the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization.
arXiv Detail & Related papers (2024-12-26T20:53:04Z) - Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition.
Our method captures inter-class differences and intra-class commonalities in the form of natural language.
VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z) - Understanding the Role of Equivariance in Self-supervised Learning [51.56331245499712]
equivariant self-supervised learning (E-SSL) learns features to be augmentation-aware.
We identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks.
We reveal several principles for practical designs of E-SSL.
arXiv Detail & Related papers (2024-11-10T16:09:47Z) - Universal New Physics Latent Space [0.0]
We develop a machine learning method for mapping data originating from both Standard Model processes and various theories beyond the Standard Model into a unified representation (latent) space.<n>We apply our method to three examples of new physics at the LHC of increasing complexity, showing that models can be clustered according to their LHC phenomenology.
arXiv Detail & Related papers (2024-07-29T18:00:00Z) - MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model [11.91010815015959]
We identify domain-specific neurons in multimodal large language models.
We propose a three-stage mechanism for language model modules in MLLMs when handling projected image features.
arXiv Detail & Related papers (2024-06-17T03:59:44Z) - Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration [39.35476224845088]
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling.
We propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step.
arXiv Detail & Related papers (2024-04-19T08:52:22Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - Dive into the Chasm: Probing the Gap between In- and Cross-Topic
Generalization [66.4659448305396]
This study analyzes various LMs with three probing-based experiments to shed light on the reasons behind the In- vs. Cross-Topic generalization gap.
We demonstrate, for the first time, that generalization gaps and the robustness of the embedding space vary significantly across LMs.
arXiv Detail & Related papers (2024-02-02T12:59:27Z) - Towards Measuring Representational Similarity of Large Language Models [1.7228514699394508]
We measure the similarity of representations of a set of large language models with 7B parameters.
Our results suggest that some LLMs are substantially different from others.
We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
arXiv Detail & Related papers (2023-12-05T12:48:04Z) - GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot
Learning [24.075034737719776]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL)
We propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection.
Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
arXiv Detail & Related papers (2023-09-02T12:07:21Z) - Finding Neurons in a Haystack: Case Studies with Sparse Probing [2.278231643598956]
Internal computations of large language models (LLMs) remain opaque and poorly understood.
We train $k$-sparse linear classifiers to predict the presence of features in the input.
By varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale.
arXiv Detail & Related papers (2023-05-02T17:13:55Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z) - Beyond Instance Discrimination: Relation-aware Contrastive
Self-supervised Learning [75.46664770669949]
We present relation-aware contrastive self-supervised learning (ReCo) to integrate instance relations.
Our ReCo consistently gains remarkable performance improvements.
arXiv Detail & Related papers (2022-11-02T03:25:28Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - The Geometry of Self-supervised Learning Models and its Impact on
Transfer Learning [62.601681746034956]
Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision.
We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each.
arXiv Detail & Related papers (2022-09-18T18:15:38Z) - Measuring dissimilarity with diffeomorphism invariance [94.02751799024684]
We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces.
We prove that DID enjoys properties which make it relevant for theoretical study and practical use.
arXiv Detail & Related papers (2022-02-11T13:51:30Z) - Sparse Interventions in Language Models with Differentiable Masking [37.220380160016624]
We propose a method that discovers within a neural LM a small subset of neurons responsible for a linguistic phenomenon.
Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons.
arXiv Detail & Related papers (2021-12-13T17:49:16Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Learning Mixtures of Random Utility Models with Features from Incomplete
Preferences [34.50516583809234]
We consider RUMs with features and their mixtures, where each alternative has a vector of features, possibly different across agents.
We extend mixtures of RUMs with features to models that generate incomplete preferences and characterize their identifiability.
Our experiments on synthetic data demonstrate the effectiveness of MLE on PL with features with tradeoffs between statistical efficiency and computational efficiency.
arXiv Detail & Related papers (2020-06-06T13:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.