Related papers: Circuit Insights: Towards Interpretability Beyond Activations

Circuit Insights: Towards Interpretability Beyond Activations

URL: http://arxiv.org/abs/2510.14936v1
Date: Thu, 16 Oct 2025 17:49:41 GMT
Title: Circuit Insights: Towards Interpretability Beyond Activations
Authors: Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin,
Abstract summary: We propose WeightLens and CircuitLens, two complementary methods for mechanistic interpretability.<n>WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets.<n> CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics.
Score: 20.178085579725472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.

Related papers

Towards Worst-Case Guarantees with Scale-Aware Interpretability [58.519943565092724]
Neural networks organize information according to the hierarchical, multi-scale structure of natural data.<n>We propose a unifying research agenda -- emphscale-aware interpretability -- to develop formal machinery and interpretability tools.
arXiv Detail & Related papers (2026-02-05T01:22:31Z)
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features [11.463277740376236]
Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass.<n>We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data.
arXiv Detail & Related papers (2026-01-30T01:30:48Z)
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders [8.188989044347595]
We propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features.<n>Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior.
arXiv Detail & Related papers (2026-01-06T12:40:37Z)
Modeling Transformers as complex networks to analyze learning dynamics [0.2538209532048867]
This project investigates whether learning dynamics can be characterized through the lens of Complex Network Theory.<n>I introduce a novel methodology to represent a Transformer-based model as a directed, weighted graph where nodes are the model's computational components.<n>I analyze a suite of graph-theoretic metrics to reveal that the network's structure evolves through distinct phases of exploration, consolidation, and refinement.
arXiv Detail & Related papers (2025-09-18T10:20:26Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [50.53703102032562]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z)
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning [1.597617022056624]
We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks.<n>Applying these methods to 2D and 3D MiniGrid environments reveals the consistent emergence of distinct navigational modules for different axes.
arXiv Detail & Related papers (2025-01-28T17:02:16Z)
Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
Transformer-based models have demonstrated exceptional performance across diverse domains. Little is known about how they operate or what are their expected dynamics. We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
arXiv Detail & Related papers (2024-06-11T13:29:34Z)
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks [0.0]
Local Interaction Basis aims to identify computational features by removing irrelevant activations and interactions. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.
arXiv Detail & Related papers (2024-05-17T17:27:19Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.<n>These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z)
Transforming Feature Space to Interpret Machine Learning Models [91.62936410696409]
This contribution proposes a novel approach that interprets machine-learning models through the lens of feature space transformations. It can be used to enhance unconditional as well as conditional post-hoc diagnostic tools. A case study on remote-sensing landcover classification with 46 features is used to demonstrate the potential of the proposed approach.
arXiv Detail & Related papers (2021-04-09T10:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.