Does It Make Sense to Explain a Black Box With Another Black Box?
- URL: http://arxiv.org/abs/2404.14943v1
- Date: Tue, 23 Apr 2024 11:40:30 GMT
- Title: Does It Make Sense to Explain a Black Box With Another Black Box?
- Authors: Julien Delaunay, Luis Galárraga, Christine Largouët,
- Abstract summary: Two main families of counterfactual explanation methods in the literature, namely, (a) emphtransparent methods that perturb the target by adding, removing, or replacing words, and (b) emphopaque approaches that project the target document into a latent, non-interpretable space where the perturbation is carried out subsequently.
Our empirical evidence shows that opaque approaches can be an overkill for downstream applications such as fake news detection or sentiment analysis since they add an additional level of complexity with no significant performance gain.
- Score: 5.377278489623063
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although counterfactual explanations are a popular approach to explain ML black-box classifiers, they are less widespread in NLP. Most methods find those explanations by iteratively perturbing the target document until it is classified differently by the black box. We identify two main families of counterfactual explanation methods in the literature, namely, (a) \emph{transparent} methods that perturb the target by adding, removing, or replacing words, and (b) \emph{opaque} approaches that project the target document into a latent, non-interpretable space where the perturbation is carried out subsequently. This article offers a comparative study of the performance of these two families of methods on three classical NLP tasks. Our empirical evidence shows that opaque approaches can be an overkill for downstream applications such as fake news detection or sentiment analysis since they add an additional level of complexity with no significant performance gain. These observations motivate our discussion, which raises the question of whether it makes sense to explain a black box using another black box.
Related papers
- Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z) - On the amplification of security and privacy risks by post-hoc
explanations in machine learning models [7.564511776742979]
Post-hoc explanation methods that highlight input dimensions according to their importance or relevance to the result also leak information that weakens security and privacy.
We propose novel explanation-guided black-box evasion attacks that lead to 10 times reduction in query count for the same success rate.
We show that the adversarial advantage from explanations can be quantified as a reduction in the total variance of the estimated gradient.
arXiv Detail & Related papers (2022-06-28T13:46:06Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - Reinforcement Explanation Learning [4.852320309766702]
Black-box methods to generate saliency maps are particularly interesting due to the fact that they do not utilize the internals of the model to explain the decision.
We formulate saliency map generation as a sequential search problem and leverage upon Reinforcement Learning (RL) to accumulate evidence from input images.
Experiments on three benchmark datasets demonstrate the superiority of the proposed approach in inference time over state-of-the-arts without hurting the performance.
arXiv Detail & Related papers (2021-11-26T10:20:01Z) - What will it take to generate fairness-preserving explanations? [15.801388187383973]
We focus on explanations applied to datasets, suggesting that explanations do not necessarily preserve the fairness properties of the black-box algorithm.
We propose future research directions for evaluating and generating explanations such that they are informative and relevant from a fairness perspective.
arXiv Detail & Related papers (2021-06-24T23:03:25Z) - Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models.
Our method is based on projecting model representation to a latent space.
Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z) - Benchmarking and Survey of Explanation Methods for Black Box Models [9.747543620322956]
We provide a categorization of explanation methods based on the type of explanation returned.
We present the most recent and widely used explainers, and we show a visual comparison among explanations and a quantitative benchmarking.
arXiv Detail & Related papers (2021-02-25T18:50:29Z) - Towards the Unification and Robustness of Perturbation and Gradient
Based Explanations [23.41512277145231]
We analyze two popular post hoc interpretation techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method.
We derive explicit closed form expressions for the explanations output by these two methods and show that they both converge to the same explanation in expectation.
We empirically validate our theory using extensive experimentation on both synthetic and real world datasets.
arXiv Detail & Related papers (2021-02-21T14:51:18Z) - Local Black-box Adversarial Attacks: A Query Efficient Approach [64.98246858117476]
Adrial attacks have threatened the application of deep neural networks in security-sensitive scenarios.
We propose a novel framework to perturb the discriminative areas of clean examples only within limited queries in black-box attacks.
We conduct extensive experiments to show that our framework can significantly improve the query efficiency during black-box perturbing with a high attack success rate.
arXiv Detail & Related papers (2021-01-04T15:32:16Z) - The Extraordinary Failure of Complement Coercion Crowdsourcing [50.599433903377374]
Crowdsourcing has eased and scaled up the collection of linguistic annotation in recent years.
We aim to collect annotated data for this phenomenon by reducing it to either of two known tasks: Explicit Completion and Natural Language Inference.
In both cases, crowdsourcing resulted in low agreement scores, even though we followed the same methodologies as in previous work.
arXiv Detail & Related papers (2020-10-12T19:04:04Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.