Related papers: Concept-Guided Interpretability via Neural Chunking

Concept-Guided Interpretability via Neural Chunking

URL: http://arxiv.org/abs/2505.11576v2
Date: Mon, 25 Aug 2025 22:19:25 GMT
Title: Concept-Guided Interpretability via Neural Chunking
Authors: Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata,
Abstract summary: We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
Score: 64.6429903327095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model's behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

Related papers

From superposition to sparse codes: interpretable representations in neural networks [3.6738925004882685]
Recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations.<n>We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations.<n>Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.
arXiv Detail & Related papers (2025-03-03T18:49:59Z)
Discovering Chunks in Neural Embeddings for Interpretability [53.80157905839065]
We propose leveraging the principle of chunking to interpret artificial neural population activities.<n>We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities.<n>We identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts.
arXiv Detail & Related papers (2025-02-03T20:30:46Z)
Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks. We show that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z)
Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z)
Unsupervised representation learning with Hebbian synaptic and structural plasticity in brain-like feedforward neural networks [0.0]
We introduce and evaluate a brain-like neural network model capable of unsupervised representation learning.<n>The model was tested on a diverse set of popular machine learning benchmarks.
arXiv Detail & Related papers (2024-06-07T08:32:30Z)
Spiking representation learning for associative memories [0.0]
We introduce a novel artificial spiking neural network (SNN) that performs unsupervised representation learning and associative memory operations.<n>The architecture of our model derives from the neocortical columnar organization and combines feedforward projections for learning hidden representations and recurrent projections for forming associative memories.
arXiv Detail & Related papers (2024-06-05T08:30:11Z)
Automatic Discovery of Visual Circuits [66.99553804855931]
We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
arXiv Detail & Related papers (2024-04-22T17:00:57Z)
Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z)
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation [0.0]
Out-of-distribution generalization in neural networks is often hampered by spurious correlations. Existing concept-removal methods tend to be overzealous by inadvertently eliminating features associated with the main task of the model. We propose an iterative algorithm that separates spurious from main-task concepts by jointly identifying two low-dimensional subspaces in the neural network representation.
arXiv Detail & Related papers (2023-10-18T14:22:36Z)
Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks [4.153804257347222]
We present Agglomerator, a framework capable of providing a representation of part-whole hierarchies from visual cues. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100.
arXiv Detail & Related papers (2022-03-07T10:56:13Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.