Related papers: Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

URL: http://arxiv.org/abs/2410.00153v1
Date: Mon, 30 Sep 2024 18:52:53 GMT
Title: Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Authors: Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du,
Abstract summary: We propose an approach to approximate the subspace representing a specific concept. We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple large language models. We also use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering.
Score: 23.594013836364628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

Related papers

Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z)
Toward a Flexible Framework for Linear Representation Hypothesis Using Maximum Likelihood Estimation [3.515066520628763]
We introduce a new notion of binary concepts as unit vectors in a canonical representation space. Our method, Sum of Activation-base Normalized Difference (SAND), formalizes the use of activation differences modeled as samples from a von Mises-Fisher distribution.
arXiv Detail & Related papers (2025-02-22T23:56:30Z)
Generalizable Sensor-Based Activity Recognition via Categorical Concept Invariant Learning [5.920971285288677]
Human Activity Recognition (HAR) aims to recognize activities by training models on massive sensor data. One crucial aspect of HAR that has been largely overlooked is that the test sets may have different distributions from training sets. We propose a Categorical Concept Invariant Learning framework for generalizable activity recognition.
arXiv Detail & Related papers (2024-12-18T08:18:03Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)
Contrastive Learning and Abstract Concepts: The Case of Natural Numbers [0.0]
We show that contrastive learning can be trained to count at a glance with high accuracy both at human as well as at super-human ranges. We compare this with the results of a trained-to-count at a glance supervised learning (SL) neural network scheme of similar architecture.
arXiv Detail & Related papers (2024-08-05T05:41:16Z)
Local Concept Embeddings for Analysis of Concept Distributions in DNN Feature Spaces [1.0923877073891446]
We propose a novel concept analysis framework for deep neural networks (DNNs) Instead of optimizing a single global concept vector on the complete dataset, it generates a local concept embedding (LoCE) vector for each individual sample. Despite its context sensitivity, our method's concept segmentation performance is competitive to global baselines.
arXiv Detail & Related papers (2023-11-24T12:22:00Z)
Identifying Linear Relational Concepts in Large Language Models [16.917379272022064]
Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts.
arXiv Detail & Related papers (2023-11-15T14:01:41Z)
Uncovering Unique Concept Vectors through Latent Space Decomposition [0.0]
Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates. We propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. Our experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand.
arXiv Detail & Related papers (2023-07-13T17:21:54Z)
Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace [1.6574413179773757]
Concept-Centric Transformers is a simple yet effective configuration of the shared global workspace for interpretability. We show that our model achieves better classification accuracy than all baselines across all problems.
arXiv Detail & Related papers (2023-05-25T06:37:39Z)
Concept Gradient: Concept-based Interpretation Without Linear Assumption [77.96338722483226]
Concept Activation Vector (CAV) relies on learning a linear relation between some latent representation of a given model and concepts. We proposed Concept Gradient (CG), extending concept-based interpretation beyond linear concept functions. We demonstrated CG outperforms CAV in both toy examples and real world datasets.
arXiv Detail & Related papers (2022-08-31T17:06:46Z)
Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z)
Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z)
Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions. We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.