Minimalist Explanation Generation and Circuit Discovery
- URL: http://arxiv.org/abs/2509.25686v1
- Date: Tue, 30 Sep 2025 02:43:44 GMT
- Title: Minimalist Explanation Generation and Circuit Discovery
- Authors: Pirzada Suhail, Aditya Anand, Amit Sethi,
- Abstract summary: In this paper, we introduce an activation-matching based approach to generate minimal explanations for machine learning decisions.<n>We train a lightweight autoencoder to produce binary masks that learn to highlight the decision-wise critical regions of an image.<n>The minimal explanations so generated also lead us to mechanistically interpreting the model internals.
- Score: 10.850989126934317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning models, by virtue of training, learn a large repertoire of decision rules for any given input, and any one of these may suffice to justify a prediction. However, in high-dimensional input spaces, such rules are difficult to identify and interpret. In this paper, we introduce an activation-matching based approach to generate minimal and faithful explanations for the decisions of pre-trained image classifiers. We aim to identify minimal explanations that not only preserve the model's decision but are also concise and human-readable. To achieve this, we train a lightweight autoencoder to produce binary masks that learns to highlight the decision-wise critical regions of an image while discarding irrelevant background. The training objective integrates activation alignment across multiple layers, consistency at the output label, priors that encourage sparsity, and compactness, along with a robustness constraint that enforces faithfulness. The minimal explanations so generated also lead us to mechanistically interpreting the model internals. In this regard we also introduce a circuit readout procedure wherein using the explanation's forward pass and gradients, we identify active channels and construct a channel-level graph, scoring inter-layer edges by ingress weight magnitude times source activation and feature-to-class links by classifier weight magnitude times feature activation. Together, these contributions provide a practical bridge between minimal input-level explanations and a mechanistic understanding of the internal computations driving model decisions.
Related papers
- Activation Matching for Explanation Generation [10.850989126934317]
We generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image.<n>We train a lightweight autoencoder to output a binary mask (m) such that the explanation (e = m odot x) preserves both the model's prediction and the intermediate activations of (x)<n>Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation.
arXiv Detail & Related papers (2025-09-27T02:12:09Z) - SIDE: Sparse Information Disentanglement for Explainable Artificial Intelligence [9.975642488603937]
Prototypical-parts-based neural networks have emerged as a promising solution by offering concept-level explanations.<n>We introduce Sparse Information Disentanglement for Explainability (SIDE), a novel method that improves the interpretability of prototypical parts.
arXiv Detail & Related papers (2025-07-25T14:34:15Z) - Bidirectional Logits Tree: Pursuing Granularity Reconcilement in Fine-Grained Classification [89.20477310885731]
This paper addresses the challenge of Granularity Competition in fine-grained classification tasks.<n>Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder.<n>We propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement.
arXiv Detail & Related papers (2024-12-17T10:42:19Z) - Network Inversion and Its Applications [9.124933643129538]
Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes"<n>Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes.<n>This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network.
arXiv Detail & Related papers (2024-11-26T10:04:52Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.<n>XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.<n>Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - On the Interpretability of Attention Networks [1.299941371793082]
We show how an attention model can be accurate but fail to be interpretable, and show that such models do occur as a result of training.
We evaluate a few attention model learning algorithms designed to encourage sparsity and demonstrate that these algorithms help improve interpretability.
arXiv Detail & Related papers (2022-12-30T15:31:22Z) - When less is more: Simplifying inputs aids neural network understanding [12.73748893809092]
In this work, we measure simplicity with the encoding bit size given by a pretrained generative model.
We investigate the effect of such simplification in several scenarios: conventional training, dataset condensation and post-hoc explanations.
arXiv Detail & Related papers (2022-01-14T18:58:36Z) - LoCo: Local Contrastive Representation Learning [93.98029899866866]
We show that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks.
This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
arXiv Detail & Related papers (2020-08-04T05:41:29Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - Forgetting Outside the Box: Scrubbing Deep Networks of Information
Accessible from Input-Output Observations [143.3053365553897]
We describe a procedure for removing dependency on a cohort of training data from a trained deep network.
We introduce a new bound on how much information can be extracted per query about the forgotten cohort.
We exploit the connections between the activation and weight dynamics of a DNN inspired by Neural Tangent Kernels to compute the information in the activations.
arXiv Detail & Related papers (2020-03-05T23:17:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.