Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation
- URL: http://arxiv.org/abs/2502.16385v1
- Date: Sat, 22 Feb 2025 23:56:30 GMT
- Title: Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation
- Authors: Trung Nguyen, Yan Leng,
- Abstract summary: We introduce a new notion of binary concepts as unit vectors in a canonical representation space.<n>Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher distribution.
- Score: 3.515066520628763
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Linear representation hypothesis posits that high-level concepts are encoded as linear directions in the representation spaces of LLMs. Park et al. (2024) formalize this notion by unifying multiple interpretations of linear representation, such as 1-dimensional subspace representation and interventions, using a causal inner product. However, their framework relies on single-token counterfactual pairs and cannot handle ambiguous contrasting pairs, limiting its applicability to complex or context-dependent concepts. We introduce a new notion of binary concepts as unit vectors in a canonical representation space, and utilize LLMs' (neural) activation differences along with maximum likelihood estimation (MLE) to compute concept directions (i.e., steering vectors). Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher (vMF) distribution, providing a principled approach to derive concept directions. We extend the applicability of Park et al. (2024) by eliminating the dependency on unembedding representations and single-token pairs. Through experiments with LLaMA models across diverse concepts and benchmarks, we demonstrate that our lightweight approach offers greater flexibility, superior performance in activation engineering tasks like monitoring and manipulation.
Related papers
- I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [79.01538178959726]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.
We introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.
We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.
We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z) - Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.
Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.
We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z) - Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts [24.990218468155383]
This work explores the intricate relationships between concept representations across different Large Language Models (LLMs)<n>We introduce a linear transformation method to bridge these representations, drawing an intriguing parallel to Plato's Allegory of the Cave.<n>A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
arXiv Detail & Related papers (2025-01-02T11:56:59Z) - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.
This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z) - Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution [23.594013836364628]
We propose an approach to approximate the subspace representing a specific concept.
We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple large language models.
We also use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering.
arXiv Detail & Related papers (2024-09-30T18:52:53Z) - Identifying Linear Relational Concepts in Large Language Models [16.917379272022064]
Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations.
We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts.
arXiv Detail & Related papers (2023-11-15T14:01:41Z) - Synergies between Disentanglement and Sparsity: Generalization and
Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations.
Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images.
In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner.
We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.