Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
- URL: http://arxiv.org/abs/2601.02978v1
- Date: Tue, 06 Jan 2026 12:40:37 GMT
- Title: Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
- Authors: Ruikang Zhang, Shuo Wang, Qi Su,
- Abstract summary: We propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features.<n>Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior.
- Score: 8.188989044347595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.
Related papers
- Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features [1.5874067490843806]
Control Reinforcement Learning trains a policy to select SAE features for steering at each token, producing interpretable intervention logs.<n> Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability.<n>On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs.
arXiv Detail & Related papers (2026-02-11T02:28:49Z) - Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z) - Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs [69.28193153685893]
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning.<n>To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features.<n>Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models.
arXiv Detail & Related papers (2026-01-16T06:29:07Z) - Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z) - InverseScope: Scalable Activation Inversion for Interpreting Large Language Models [5.670123459649656]
InverseScope is an assumption-light and scalable framework for interpreting neural activations via input inversion.<n>To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture.<n>We also introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency rate computed over the sampled inputs.
arXiv Detail & Related papers (2025-06-09T03:59:28Z) - LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [102.1527101235251]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z) - The Complexity of Learning Sparse Superposed Features with Feedback [2.4140387101794283]
We investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent.<n>We analyze the feedback complexity associated with learning a feature matrix in sparse settings.<n>Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios.
arXiv Detail & Related papers (2025-02-08T01:54:23Z) - LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.<n>We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.<n>Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z) - LatentQA: Teaching LLMs to Decode Activations Into Natural Language [72.87064562349742]
We introduce LatentQA, the task of answering open-ended questions about model activations in natural language.<n>We propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs.<n>Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations.
arXiv Detail & Related papers (2024-12-11T18:59:33Z) - Guiding the PLMs with Semantic Anchors as Intermediate Supervision:
Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network.
By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks.
We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.