SAFR: Neuron Redistribution for Interpretability
- URL: http://arxiv.org/abs/2501.16374v2
- Date: Tue, 11 Feb 2025 00:26:45 GMT
- Title: SAFR: Neuron Redistribution for Interpretability
- Authors: Ruidi Chang, Chunyuan Deng, Hanjie Chen,
- Abstract summary: Superposition refers to encoding representations of multiple features within a single neuron.<n>Despite promising performance, the model's interpretability has been diminished.<n>This paper presents a novel approach to enhance model interpretability by regularizing feature superposition.
- Score: 7.756342860929851
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Superposition refers to encoding representations of multiple features within a single neuron, which is common in deep neural networks. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model's interpretability has been diminished. This paper presents a novel approach to enhance model interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights respectively. We evaluate SAFR with a transformer model on two classification tasks. Experiments demonstrate the effectiveness of SAFR in improving model interpretability without compromising prediction performance. Besides, SAFR provides explanations by visualizing the neuron allocation within the intermediate layers.
Related papers
- Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - Learning local discrete features in explainable-by-design convolutional neural networks [0.0]
We introduce an explainable-by-design convolutional neural network (CNN) based on the lateral inhibition mechanism.
The model consists of the predictor, that is a high-accuracy CNN with residual or dense skip connections.
By collecting observations and directly calculating probabilities, we can explain causal relationships between motifs of adjacent levels.
arXiv Detail & Related papers (2024-10-31T18:39:41Z) - Improving Neuron-level Interpretability with White-box Language Models [11.898535906016907]
We introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE)<n>Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability.<n>CRATE's increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens.
arXiv Detail & Related papers (2024-10-21T19:12:33Z) - PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE is a self-supervised learning framework that enhances global feature representation of point cloud mask autoencoders.
We show that PseudoNeg-MAE achieves state-of-the-art performance on the ModelNet40 and ScanObjectNN datasets.
arXiv Detail & Related papers (2024-09-24T07:57:21Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - NPEFF: Non-Negative Per-Example Fisher Factorization [52.44573961263344]
We introduce a novel interpretability method called NPEFF that is readily applicable to any end-to-end differentiable model.
We demonstrate that NPEFF has interpretable tunings through experiments on language and vision models.
arXiv Detail & Related papers (2023-10-07T02:02:45Z) - Interpretable Sentence Representation with Variational Autoencoders and
Attention [0.685316573653194]
We develop methods to enhance the interpretability of recent representation learning techniques in natural language processing (NLP)
We leverage Variational Autoencoders (VAEs) due to their efficiency in relating observations to latent generative factors.
We build two models with inductive bias to separate information in latent representations into understandable concepts without annotated data.
arXiv Detail & Related papers (2023-05-04T13:16:15Z) - Learning Disentangled Semantic Spaces of Explanations via Invertible Neural Networks [10.880057430629126]
Disentangled latent spaces usually have better semantic separability and geometrical properties, which leads to better interpretability and more controllable data generation.
In this work, we focus on a more general form of sentence disentanglement, targeting the localised modification and control of more general sentence semantic features.
We introduce a flow-based invertible neural network (INN) mechanism integrated with a transformer-based language Autoencoder (AE) in order to deliver latent spaces with better separability properties.
arXiv Detail & Related papers (2023-05-02T18:27:13Z) - Modeling Implicit Bias with Fuzzy Cognitive Maps [0.0]
This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets.
We introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating.
arXiv Detail & Related papers (2021-12-23T17:04:12Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - FF-NSL: Feed-Forward Neural-Symbolic Learner [70.978007919101]
This paper introduces a neural-symbolic learning framework, called Feed-Forward Neural-Symbolic Learner (FF-NSL)
FF-NSL integrates state-of-the-art ILP systems based on the Answer Set semantics, with neural networks, in order to learn interpretable hypotheses from labelled unstructured data.
arXiv Detail & Related papers (2021-06-24T15:38:34Z) - It's FLAN time! Summing feature-wise latent representations for
interpretability [0.0]
We propose a novel class of structurally-constrained neural networks, which we call FLANs (Feature-wise Latent Additive Networks)
FLANs process each input feature separately, computing for each of them a representation in a common latent space.
These feature-wise latent representations are then simply summed, and the aggregated representation is used for prediction.
arXiv Detail & Related papers (2021-06-18T12:19:33Z) - And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks.
We define AND-like neurons and propose measures to increase their proportion in the network.
Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z) - RatE: Relation-Adaptive Translating Embedding for Knowledge Graph
Completion [51.64061146389754]
We propose a relation-adaptive translation function built upon a novel weighted product in complex space.
We then present our Relation-adaptive translating Embedding (RatE) approach to score each graph triple.
arXiv Detail & Related papers (2020-10-10T01:30:30Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - GAMI-Net: An Explainable Neural Network based on Generalized Additive
Models with Structured Interactions [5.8010446129208155]
An explainable neural network based on generalized additive models with structured interactions (GAMI-Net) is proposed to pursue a good balance between prediction accuracy and model interpretability.
GAMI-Net is a disentangled feedforward network with multiple additiveworks.
Numerical experiments on both synthetic functions and real-world datasets show that the proposed model enjoys superior interpretability.
arXiv Detail & Related papers (2020-03-16T11:51:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.