Related papers: Measuring and Guiding Monosemanticity

Measuring and Guiding Monosemanticity

URL: http://arxiv.org/abs/2506.19382v1
Date: Tue, 24 Jun 2025 07:18:20 GMT
Title: Measuring and Guiding Monosemanticity
Authors: Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting,
Abstract summary: Current methods face challenges in reliably localizing and manipulating feature representations.<n>We propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training.<n>G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation.
Score: 25.40936584258291
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

Related papers

Unified modality separation: A vision-language framework for unsupervised domain adaptation [60.8391821117794]
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains.<n>We propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components.<n>Our methods achieve up to 9% performance gain with 9 times of computational efficiencies.
arXiv Detail & Related papers (2025-08-07T02:51:10Z)
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z)
Towards LLM Guardrails via Sparse Representation Steering [11.710399901426873]
Large Language Models (LLMs) have demonstrated remarkable performance in natural language generation tasks.<n>We propose a sparse encoding-based representation engineering method, named SRE, which decomposes polysemantic activations into a structured, monosemantic feature space.<n>By leveraging sparse autoencoding, our approach isolates and adjusts only task-specific sparse feature dimensions, enabling precise and interpretable steering of model behavior.
arXiv Detail & Related papers (2025-03-21T04:50:25Z)
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts [0.6291443816903801]
This paper introduces a novel framework designed to autonomously evaluate the robustness of large language models (LLMs)<n>Our method generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts.<n>This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks.
arXiv Detail & Related papers (2024-12-01T10:58:53Z)
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF) representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states. It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z)
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability. We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence. We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN) We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.