Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
- URL: http://arxiv.org/abs/2602.10437v2
- Date: Thu, 12 Feb 2026 02:43:40 GMT
- Title: Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
- Authors: Seonglae Cho, Zekun Wu, Adriano Koshiyama,
- Abstract summary: Control Reinforcement Learning trains a policy to select SAE features for steering at each token, producing interpretable intervention logs.<n> Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability.<n>On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs.
- Score: 1.5874067490843806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes
Related papers
- Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z) - Explaining AutoClustering: Uncovering Meta-Feature Contribution in AutoML for Clustering [0.6487259764989486]
AutoClustering methods often leverage meta-learning over dataset meta-features.<n>This limits reliability, bias diagnostics, and efficient meta-feature engineering.<n>This study offers a practical foundation for increasing decision transparency in unsupervised learning automation.
arXiv Detail & Related papers (2026-02-20T17:01:25Z) - Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders [8.188989044347595]
We propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features.<n>Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior.
arXiv Detail & Related papers (2026-01-06T12:40:37Z) - Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs [49.66344956133349]
Reasoning capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models.<n>This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a latent variable for strategic contextualization.
arXiv Detail & Related papers (2025-12-19T03:32:53Z) - SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks [0.0]
We present SALVE, a framework that bridges mechanistic interpretability and model editing.<n>We learn a sparse, model-native feature basis without supervision.<n>We validate these features with Grad-FAM, a feature-level saliency mapping method.
arXiv Detail & Related papers (2025-12-17T20:06:03Z) - ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning [51.133569963553576]
ssToken is a Self-modulated and Semantic-aware Token Selection approach.<n>We show that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning.
arXiv Detail & Related papers (2025-10-21T03:21:04Z) - Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z) - Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - FADE: Why Bad Descriptions Happen to Good Features [14.00042287629001]
We introduce FADE: Feature Alignment to Description Evaluation.<n>FADE is a scalable framework for automatically evaluating feature-to-description alignment.<n>We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines.
arXiv Detail & Related papers (2025-02-24T09:28:35Z) - Analyze Feature Flow to Enhance Interpretation and Steering in Language Models [3.8498574327875947]
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models.<n>By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage.
arXiv Detail & Related papers (2025-02-05T09:39:34Z) - X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.