Related papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

URL: http://arxiv.org/abs/2410.06981v2
Date: Fri, 31 Jan 2025 15:27:10 GMT
Title: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Authors: Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez,
Abstract summary: Demonstrating feature universality allows discoveries about latent representations to generalize across several models. We employ a method known as dictionary learning to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
Score: 14.594698598522797
License:
Abstract: We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones, making it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics on SAE feature spaces across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.

Related papers

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z)
Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations. SSL features conflict with traditional spectral features like FBanks in terms of update directions. We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z)
On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages [56.22289522687125]
Selective state-space models (SSMs) are an emerging alternative to the Transformer. We analyze their expressiveness and length generalization performance on regular language tasks. We introduce the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization.
arXiv Detail & Related papers (2024-12-26T20:53:04Z)
Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition. Our method captures inter-class differences and intra-class commonalities in the form of natural language. VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z)
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model [11.91010815015959]
We identify domain-specific neurons in multimodal large language models. We propose a three-stage mechanism for language model modules in MLLMs when handling projected image features.
arXiv Detail & Related papers (2024-06-17T03:59:44Z)
Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration [39.35476224845088]
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. We propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step.
arXiv Detail & Related papers (2024-04-19T08:52:22Z)
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora. We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs. Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z)
GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning [24.075034737719776]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL) We propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
arXiv Detail & Related papers (2023-09-02T12:07:21Z)
Finding Neurons in a Haystack: Case Studies with Sparse Probing [2.278231643598956]
Internal computations of large language models (LLMs) remain opaque and poorly understood. We train $k$-sparse linear classifiers to predict the presence of features in the input. By varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale.
arXiv Detail & Related papers (2023-05-02T17:13:55Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present our approach of constructing analogy datasets in terms of words, phrases and sentences. We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.