Related papers: Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

URL: http://arxiv.org/abs/2603.03031v1
Date: Tue, 03 Mar 2026 14:25:02 GMT
Title: Step-Level Sparse Autoencoder for Reasoning Process Interpretation
Authors: Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao,
Abstract summary: Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
Score: 48.99201531966593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features [1.5874067490843806]
Control Reinforcement Learning trains a policy to select SAE features for steering at each token, producing interpretable intervention logs.<n> Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability.<n>On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs.
arXiv Detail & Related papers (2026-02-11T02:28:49Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures [12.466522376751811]
Hyperdimensional Probe is a novel paradigm for decoding information from the Large Language Models vector space.<n>It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts.<n>Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.
arXiv Detail & Related papers (2025-09-29T16:59:07Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring [2.1205272468688574]
We propose a cognitive architecture for ML monitoring that applies feature engineering principles to agents based on Large Language Models.<n>Decision Procedure module simulates feature engineering through three key steps: Refactor, Break Down, and Compile.<n> Experiments using multiple LLMs demonstrate the efficacy of our approach, achieving significantly higher accuracy compared to various baselines.
arXiv Detail & Related papers (2025-06-11T13:48:25Z)
FOL-Pretrain: A complexity annotated corpus of first-order logic [16.061040115094592]
Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities.<n>Despite recent efforts to reverse-engineer LLM behavior, our understanding of how these models internalize and execute complex algorithms remains limited.<n>We introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces.
arXiv Detail & Related papers (2025-05-20T21:38:28Z)
Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions [60.43398881149664]
We introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LLM Output Signature.<n>It achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency.
arXiv Detail & Related papers (2025-03-18T09:04:37Z)
Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z)
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.