Double Trouble: How to not explain a text classifier's decisions using
counterfactuals synthesized by masked language models?
- URL: http://arxiv.org/abs/2110.11929v1
- Date: Fri, 22 Oct 2021 17:22:05 GMT
- Title: Double Trouble: How to not explain a text classifier's decisions using
counterfactuals synthesized by masked language models?
- Authors: Thang M. Pham, Trung Bui, Long Mai, Anh Nguyen
- Abstract summary: An underlying principle behind dozens of explanation methods is to take the prediction difference between before-and-after an input feature is removed as its attribution.
A recent method called Input Marginalization (IM) uses BERT to replace a token, yielding more plausible counterfactuals.
However, our rigorous evaluation using five metrics and on three datasets found IM explanations to be consistently more biased, less accurate, and less plausible than those derived from simply deleting a word.
- Score: 34.18339528128342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explaining how important each input feature is to a classifier's decision is
critical in high-stake applications. An underlying principle behind dozens of
explanation methods is to take the prediction difference between
before-and-after an input feature (here, a token) is removed as its attribution
- the individual treatment effect in causal inference. A recent method called
Input Marginalization (IM) (Kim et al., 2020) uses BERT to replace a token -
i.e. simulating the do(.) operator - yielding more plausible counterfactuals.
However, our rigorous evaluation using five metrics and on three datasets found
IM explanations to be consistently more biased, less accurate, and less
plausible than those derived from simply deleting a word.
Related papers
- A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers [0.0]
We evaluate five methods for generating counterfactual explanations for a BERT text classifier.
established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier's output.
newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals.
arXiv Detail & Related papers (2024-11-04T22:01:52Z) - Reweighting Strategy based on Synthetic Data Identification for Sentence
Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences.
The distilled information from the classifier is then used to train a reliable sentence embedding model.
Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - More Than Words: Towards Better Quality Interpretations of Text
Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z) - Search Methods for Sufficient, Socially-Aligned Feature Importance
Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time.
We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z) - Variable Instance-Level Explainability for Text Classification [9.147707153504117]
We propose a method for extracting variable-length explanations using a set of different feature scoring methods at instance-level.
Our method consistently provides more faithful explanations compared to previous fixed-length and fixed-feature scoring methods for rationale extraction.
arXiv Detail & Related papers (2021-04-16T16:53:48Z) - Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models.
Our method is based on projecting model representation to a latent space.
Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z) - Evaluating Explainable AI: Which Algorithmic Explanations Help Users
Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability.
Clear evidence of method effectiveness is found in very few cases.
Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z) - How do Decisions Emerge across Layers in Neural Models? Interpretation
with Differentiable Masking [70.92463223410225]
DiffMask learns to mask-out subsets of the input while maintaining differentiability.
Decision to include or disregard an input token is made with a simple model based on intermediate hidden layers.
This lets us not only plot attribution heatmaps but also analyze how decisions are formed across network layers.
arXiv Detail & Related papers (2020-04-30T17:36:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.