On the Lack of Robust Interpretability of Neural Text Classifiers
- URL: http://arxiv.org/abs/2106.04631v1
- Date: Tue, 8 Jun 2021 18:31:02 GMT
- Title: On the Lack of Robust Interpretability of Neural Text Classifiers
- Authors: Muhammad Bilal Zafar, Michele Donini, Dylan Slack, C\'edric
Archambeau, Sanjiv Das, Krishnaram Kenthapadi
- Abstract summary: We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders.
Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
- Score: 14.685352584216757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the ever-increasing complexity of neural language models, practitioners
have turned to methods for understanding the predictions of these models. One
of the most well-adopted approaches for model interpretability is feature-based
interpretability, i.e., ranking the features in terms of their impact on model
predictions. Several prior studies have focused on assessing the fidelity of
feature-based interpretability methods, i.e., measuring the impact of dropping
the top-ranked features on the model output. However, relatively little work
has been conducted on quantifying the robustness of interpretations. In this
work, we assess the robustness of interpretations of neural text classifiers,
specifically, those based on pretrained Transformer encoders, using two
randomization tests. The first compares the interpretations of two models that
are identical except for their initializations. The second measures whether the
interpretations differ between a model with trained parameters and a model with
random parameters. Both tests show surprising deviations from expected
behavior, raising questions about the extent of insights that practitioners may
draw from interpretations.
Related papers
- Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Hierarchical Interpretation of Neural Text Classification [31.95426448656938]
This paper proposes a novel Hierarchical INTerpretable neural text classifier, called Hint, which can automatically generate explanations of model predictions.
Experimental results on both review datasets and news datasets show that our proposed approach achieves text classification results on par with existing state-of-the-art text classifiers.
arXiv Detail & Related papers (2022-02-20T11:15:03Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Influence Tuning: Demoting Spurious Correlations via Instance
Attribution and Instance-Driven Updates [26.527311287924995]
influence tuning can help deconfounding the model from spurious patterns in data.
We show that in a controlled setup, influence tuning can help deconfounding the model from spurious patterns in data.
arXiv Detail & Related papers (2021-10-07T06:59:46Z) - Instance-Based Neural Dependency Parsing [56.63500180843504]
We develop neural models that possess an interpretable inference process for dependency parsing.
Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set.
arXiv Detail & Related papers (2021-09-28T05:30:52Z) - The Definitions of Interpretability and Learning of Interpretable Models [42.22982369082474]
We propose a mathematical definition for the human-interpretable model.
If a prediction model is interpretable by a human recognition system, the prediction model is defined as a completely human-interpretable model.
arXiv Detail & Related papers (2021-05-29T01:44:12Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Evaluating Saliency Methods for Neural Language Models [9.309351023703018]
Saliency methods are widely used to interpret neural network predictions.
Different variants of saliency methods disagree even on the interpretations of the same prediction made by the same model.
We conduct a comprehensive and quantitative evaluation of saliency methods on a fundamental category of NLP models: neural language models.
arXiv Detail & Related papers (2021-04-12T21:19:48Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z) - Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented.
It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts.
Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.