Related papers: Improving Neuron-level Interpretability with White-box Language Models

Improving Neuron-level Interpretability with White-box Language Models

URL: http://arxiv.org/abs/2410.16443v1
Date: Mon, 21 Oct 2024 19:12:33 GMT
Title: Improving Neuron-level Interpretability with White-box Language Models
Authors: Hao Bai, Yi Ma,
Abstract summary: We introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE) Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability. CRATE's increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens.
Score: 11.898535906016907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE), explicitly engineered to capture sparse, low-dimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining CRATE's robust performance in enhancing neural network interpretability. Further analysis shows that CRATE's increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.

Related papers

Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces [75.45093712182624]
We introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO)<n>We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators.<n>We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
arXiv Detail & Related papers (2025-09-03T21:57:03Z)
Probing Neural Topology of Large Language Models [15.34202977968525]
We introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons.<n>We find a universal predictability of next-token prediction performance using only neural topology.<n>This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps.
arXiv Detail & Related papers (2025-06-01T14:57:03Z)
Concept-Guided Interpretability via Neural Chunking [64.6429903327095]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT [0.0]
This study investigates the internal representations of verb-particle combinations within large language models (LLMs) We analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up' Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories.
arXiv Detail & Related papers (2024-12-19T09:21:39Z)
Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks [1.1711824752079485]
dissertation focuses on developing intrinsically interpretable models when using LMs as encoders. I developed a novel white-box multi-head graph attention-based prototype network. I am working on extending the attention-based prototype network with contrastive learning to redesign an interpretable graph neural network.
arXiv Detail & Related papers (2024-12-04T22:59:35Z)
Interpretable Language Modeling via Induction-head Ngram Models [74.26720927767398]
We propose Induction-head ngram models (Induction-Gram) to bolster modern ngram models with a hand-engineered "induction head" This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. Experiments show that this simple method significantly improves next-word prediction over baseline interpretable models.
arXiv Detail & Related papers (2024-10-31T12:33:26Z)
Cognitive Networks and Performance Drive fMRI-Based State Classification Using DNN Models [0.0]
We employ two structurally different and complementary DNN-based models to classify individual cognitive states. We show that despite the architectural differences, both models consistently produce a robust relationship between prediction accuracy and individual cognitive performance.
arXiv Detail & Related papers (2024-08-14T15:25:51Z)
Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance. Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models [16.020535763297175]
Machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We quantify one form of mechanistic interpretability for a diverse suite of nine models. None of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago.
arXiv Detail & Related papers (2023-07-11T17:56:22Z)
Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean [1.0923877073891446]
Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks. Despite this success, the inner workings of DNNs are often not transparent. This lack of interpretability has led to increased research on inherently interpretable neural networks.
arXiv Detail & Related papers (2023-01-27T17:06:13Z)
Seeking Interpretability and Explainability in Binary Activated Neural Networks [2.828173677501078]
We study the use of binary activated neural networks as interpretable and explainable predictors in the context of regression tasks. We present an approach based on the efficient computation of SHAP values for quantifying the relative importance of the features, hidden neurons and even weights.
arXiv Detail & Related papers (2022-09-07T20:11:17Z)
Functional Network: A Novel Framework for Interpretability of Deep Neural Networks [2.641939670320645]
We propose a novel framework for interpretability of deep neural networks, that is, the functional network. In our experiments, the mechanisms of regularization methods, namely, batch normalization and dropout, are revealed.
arXiv Detail & Related papers (2022-05-24T01:17:36Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)
PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context. We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.