Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
- URL: http://arxiv.org/abs/2601.14004v3
- Date: Mon, 26 Jan 2026 09:33:08 GMT
- Title: Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
- Authors: Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong,
- Abstract summary: Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the decision-making of Large Language Models (LLMs)<n>We present a practical survey structured around the pipeline: "Awesomeinterventionable-MI-Survey"
- Score: 122.58252919699122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
Related papers
- Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection [9.796641194900749]
Focus-LIME is a coarse-to-fine framework designed to restore the tractability of surgical interpretation.<n>Our method makes surgical explanations practicable and provides faithful explanations to users.
arXiv Detail & Related papers (2026-02-04T14:34:30Z) - Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z) - When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z) - InverseScope: Scalable Activation Inversion for Interpreting Large Language Models [5.670123459649656]
InverseScope is an assumption-light and scalable framework for interpreting neural activations via input inversion.<n>To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture.<n>We also introduce a quantitative evaluation protocol that tests interpretability hypotheses using the feature consistency rate computed over the sampled inputs.
arXiv Detail & Related papers (2025-06-09T03:59:28Z) - Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics [0.8594140167290097]
We develop parsimonious minimally-optimal representations that can facilitate better insight regarding system functioning.<n>We find that both physical interpretability and good predictive performance can be achieved using a distributed-state network.<n>The results indicate that MCP-based ML models with only a few layers (up to two) and relativity few physical flow pathways can play a significant role in ML-based streamflow modelling.
arXiv Detail & Related papers (2024-12-06T08:30:01Z) - STAT: Towards Generalizable Temporal Action Localization [56.634561073746056]
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels.
Existing methods suffer from severe performance degradation when transferring to different distributions.
We propose GTAL, which focuses on improving the generalizability of action localization methods.
arXiv Detail & Related papers (2024-04-20T07:56:21Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - OSM: Leveraging Model Checking for Observing Dynamic 1 behaviors in
Aspect-Oriented Applications [0.0]
observe-based statistical model-checking (OSM) framework devised to craft executable formal models directly from foundational system code.
This ensures the unyielding performance of electronic health record systems amidst shifting preconditions.
arXiv Detail & Related papers (2024-03-03T00:03:34Z) - Towards Best Practices of Activation Patching in Language Models:
Metrics and Methods [9.121998462494533]
We examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods.
Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred.
arXiv Detail & Related papers (2023-09-27T21:53:56Z) - MACE: An Efficient Model-Agnostic Framework for Counterfactual
Explanation [132.77005365032468]
We propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE)
In our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity.
Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
arXiv Detail & Related papers (2022-05-31T04:57:06Z) - Quantifying Explainability in NLP and Analyzing Algorithms for
Performance-Explainability Tradeoff [0.0]
We explore the current art of explainability and interpretability within a case study in clinical text classification.
We demonstrate various visualization techniques for fully interpretable methods as well as model-agnostic post hoc attributions.
We introduce a framework through which practitioners and researchers can assess the frontier between a model's predictive performance and the quality of its available explanations.
arXiv Detail & Related papers (2021-07-12T19:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.