Related papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

URL: http://arxiv.org/abs/2410.06981v1
Date: Wed, 9 Oct 2024 15:18:57 GMT
Title: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Authors: Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez,
Abstract summary: Demonstrating feature universality allows discoveries about latent representations to generalize across several models. We employ a method known as dictionary learning to transform LLM activations into interpretable spaces spanned by neurons corresponding to individual features. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
Score: 14.594698598522797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones. This makes it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics like Singular Value Canonical Correlation Analysis to analyze these SAE features across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.

Related papers

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z)
Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations. SSL features conflict with traditional spectral features like FBanks in terms of update directions. We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z)
On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages [56.22289522687125]
Selective state-space models (SSMs) are an emerging alternative to the Transformer. We analyze their expressiveness and length generalization performance on regular language tasks. We introduce the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization.
arXiv Detail & Related papers (2024-12-26T20:53:04Z)
Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition. Our method captures inter-class differences and intra-class commonalities in the form of natural language. VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z)
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model [11.91010815015959]
We identify domain-specific neurons in multimodal large language models. We propose a three-stage mechanism for language model modules in MLLMs when handling projected image features.
arXiv Detail & Related papers (2024-06-17T03:59:44Z)
Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration [39.35476224845088]
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. We propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step.
arXiv Detail & Related papers (2024-04-19T08:52:22Z)
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z)
Dive into the Chasm: Probing the Gap between In- and Cross-Topic Generalization [66.4659448305396]
This study analyzes various LMs with three probing-based experiments to shed light on the reasons behind the In- vs. Cross-Topic generalization gap. We demonstrate, for the first time, that generalization gaps and the robustness of the embedding space vary significantly across LMs.
arXiv Detail & Related papers (2024-02-02T12:59:27Z)
Towards Measuring Representational Similarity of Large Language Models [1.7228514699394508]
We measure the similarity of representations of a set of large language models with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
arXiv Detail & Related papers (2023-12-05T12:48:04Z)
GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning [24.075034737719776]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL) We propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
arXiv Detail & Related papers (2023-09-02T12:07:21Z)
Finding Neurons in a Haystack: Case Studies with Sparse Probing [2.278231643598956]
Internal computations of large language models (LLMs) remain opaque and poorly understood. We train $k$-sparse linear classifiers to predict the presence of features in the input. By varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale.
arXiv Detail & Related papers (2023-05-02T17:13:55Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data. Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)
Sparse Interventions in Language Models with Differentiable Masking [37.220380160016624]
We propose a method that discovers within a neural LM a small subset of neurons responsible for a linguistic phenomenon. Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons.
arXiv Detail & Related papers (2021-12-13T17:49:16Z)
Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay. Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present our approach of constructing analogy datasets in terms of words, phrases and sentences. We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
Learning Mixtures of Random Utility Models with Features from Incomplete Preferences [34.50516583809234]
We consider RUMs with features and their mixtures, where each alternative has a vector of features, possibly different across agents. We extend mixtures of RUMs with features to models that generate incomplete preferences and characterize their identifiability. Our experiments on synthetic data demonstrate the effectiveness of MLE on PL with features with tradeoffs between statistical efficiency and computational efficiency.
arXiv Detail & Related papers (2020-06-06T13:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.