Related papers: Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

URL: http://arxiv.org/abs/2601.14112v2
Date: Wed, 21 Jan 2026 14:35:20 GMT
Title: Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
Authors: George Mihaila,
Abstract summary: We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores.<n>We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

Related papers

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks [0.0]
We present SALVE, a framework that bridges mechanistic interpretability and model editing.<n>We learn a sparse, model-native feature basis without supervision.<n>We validate these features with Grad-FAM, a feature-level saliency mapping method.
arXiv Detail & Related papers (2025-12-17T20:06:03Z)
Pushing the Boundaries of Interpretability: Incremental Enhancements to the Explainable Boosting Machine [1.2461503242570642]
This paper aims at improving the Explainable Boosting Machine (EBM), a state-of-the-art glassbox model that delivers both high accuracy and complete transparency.<n>The work is positioned as a critical step toward developing machine learning systems that are robust, equitable, and transparent.
arXiv Detail & Related papers (2025-11-29T15:46:13Z)
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [57.19302613163439]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks [52.46420522934253]
We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks.<n>The method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble.
arXiv Detail & Related papers (2024-05-23T11:10:32Z)
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers [14.147646140595649]
Large Language Models are prone to biased predictions and hallucinations. achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge.
arXiv Detail & Related papers (2024-02-08T12:01:24Z)
Tensor Networks for Explainable Machine Learning in Cybersecurity [0.0]
We develop an unsupervised clustering algorithm based on Matrix Product States (MPS)<n>Our investigation proves that MPS rival traditional deep learning models such as autoencoders and GANs in terms of performance.<n>Our approach naturally facilitates the extraction of feature-wise probabilities, Von Neumann Entropy, and mutual information.
arXiv Detail & Related papers (2023-12-29T22:35:45Z)
Evaluating Explainability in Machine Learning Predictions through Explainer-Agnostic Metrics [0.0]
We develop six distinct model-agnostic metrics designed to quantify the extent to which model predictions can be explained. These metrics measure different aspects of model explainability, ranging from local importance, global importance, and surrogate predictions. We demonstrate the practical utility of these metrics on classification and regression tasks, and integrate these metrics into an existing Python package for public use.
arXiv Detail & Related papers (2023-02-23T15:28:36Z)
VCNet: A self-explaining model for realistic counterfactual generation [52.77024349608834]
Counterfactual explanation is a class of methods to make local explanations of machine learning decisions. We present VCNet-Variational Counter Net, a model architecture that combines a predictor and a counterfactual generator. We show that VCNet is able to both generate predictions, and to generate counterfactual explanations without having to solve another minimisation problem.
arXiv Detail & Related papers (2022-12-21T08:45:32Z)
Recurrence-Aware Long-Term Cognitive Network for Explainable Pattern Classification [0.0]
We propose an LTCN-based model for interpretable pattern classification of structured data. Our method brings its own mechanism for providing explanations by quantifying the relevance of each feature in the decision process. Our interpretable model obtains competitive performance when compared to the state-of-the-art white and black boxes.
arXiv Detail & Related papers (2021-07-07T18:14:50Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers [55.40032342541187]
We pre-train a transformer-based model with attention algorithms in a self-supervised fashion and treat them as feature extractors on downstream tasks. Our approach shows comparable performance to the typical self-attention yet requires 20% less time in both training and inference.
arXiv Detail & Related papers (2020-06-09T10:40:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.