Related papers: MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

URL: http://arxiv.org/abs/2512.04386v1
Date: Thu, 04 Dec 2025 02:20:06 GMT
Title: MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation
Authors: Zhou Yang, Shunyan Luo, Jiazhen Zhu, Fang Jin,
Abstract summary: We introduce the Model-agnostic Saliency Estimation (MASE) framework.<n>MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture.<n>Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy.
Score: 3.163339926484699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.

Related papers

Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
Generalizing Backpropagation for Gradient-Based Interpretability [103.2998254573497]
We show that the gradient of a model is a special case of a more general formulation using semirings. This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics.
arXiv Detail & Related papers (2023-07-06T15:19:53Z)
Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost [6.662800021628275]
We study the use of large language models (LLMs) for annotating inputs and improving the generalization of NLP models. We propose a sampling strategy based on the difference in prediction scores between the base model and the finetuned NLP model.
arXiv Detail & Related papers (2023-06-27T19:29:55Z)
Augmenting Interpretable Models with LLMs during Training [73.40079895413861]
We propose Augmented Interpretable Models (Aug-imodels) to build efficient and interpretable models. Aug-imodels use LLMs during fitting but not during inference, allowing complete transparency. We explore two instantiations of Aug-imodels in natural-language processing: (i) Aug-GAM, which augments a generalized additive model with decoupled embeddings from an LLM and (ii) Aug-Tree, which augments a decision tree with LLM feature expansions.
arXiv Detail & Related papers (2022-09-23T18:36:01Z)
Locally Aggregated Feature Attribution on Natural Language Model Understanding [12.233103741197334]
Locally Aggregated Feature Attribution (LAFA) is a novel gradient-based feature attribution method for NLP models. Instead of relying on obscure reference tokens, it smooths gradients by aggregating similar reference texts derived from language model embeddings. For evaluation purpose, we also design experiments on different NLP tasks including Entity Recognition and Sentiment Analysis on public datasets.
arXiv Detail & Related papers (2022-04-22T18:59:27Z)
Evaluating the Robustness of Neural Language Models to Input Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts. We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations. The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z)
Model Explainability in Deep Learning Based Natural Language Processing [0.0]
We reviewed and compared some popular machine learning model explainability methodologies. We applied one of the NLP explainability methods to a NLP classification model. We identified some common issues due to the special natures of NLP models.
arXiv Detail & Related papers (2021-06-14T13:23:20Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
Explaining and Improving Model Behavior with k Nearest Neighbor Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions. We show that kNN representations are effective at uncovering learned spurious associations. Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking [63.49779304362376]
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. We introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. We show that we can drop a large proportion of edges without deteriorating the performance of the model.
arXiv Detail & Related papers (2020-10-01T17:51:19Z)
Surrogate Locally-Interpretable Models with Supervised Machine Learning Algorithms [8.949704905866888]
Supervised Machine Learning algorithms have become popular in recent years due to their superior predictive performance over traditional statistical methods. The main focus is on interpretability, the resulting surrogate model also has reasonably good predictive performance.
arXiv Detail & Related papers (2020-07-28T23:46:16Z)
Towards Interpretable Deep Learning Models for Knowledge Tracing [62.75876617721375]
We propose to adopt the post-hoc method to tackle the interpretability issue for deep learning based knowledge tracing (DLKT) models. Specifically, we focus on applying the layer-wise relevance propagation (LRP) method to interpret RNN-based DLKT model. Experiment results show the feasibility using the LRP method for interpreting the DLKT model's predictions.
arXiv Detail & Related papers (2020-05-13T04:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.