Related papers: Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

URL: http://arxiv.org/abs/2303.13299v1
Date: Thu, 23 Mar 2023 14:35:37 GMT
Title: Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective
Authors: Avi Schwarzschild, Max Cembalest, Karthik Rao, Keegan Hines, John Dickerson
Abstract summary: Post hoc feature attribution is a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers is that they can disagree on which features are more important than others. We introduce a loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term
Score: 5.949779668853556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.

Related papers

How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations [69.72654127617058]
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) In this work we bring forward empirical evidence that challenges this very notion. We discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer play a crucial role.
arXiv Detail & Related papers (2025-03-01T22:25:11Z)
A Critical Assessment of Interpretable and Explainable Machine Learning for Intrusion Detection [0.0]
We study the use of overly complex and opaque ML models, unaccounted data imbalances and correlated features, inconsistent influential features across different explanation methods, and the implausible utility of explanations. Specifically, we advise avoiding complex opaque models such as Deep Neural Networks and instead using interpretable ML models such as Decision Trees. We find that feature-based model explanations are most often inconsistent across different settings.
arXiv Detail & Related papers (2024-07-04T15:35:42Z)
Toward Understanding the Disagreement Problem in Neural Network Feature Attribution [0.8057006406834466]
neural networks have demonstrated their remarkable ability to discern intricate patterns and relationships from raw data. Understanding the inner workings of these black box models remains challenging, yet crucial for high-stake decisions. Our work addresses this confusion by investigating the explanations' fundamental and distributional behavior.
arXiv Detail & Related papers (2024-04-17T12:45:59Z)
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z)
A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing [7.527234046228323]
We argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations. We find that attention-based explanations do not correlate strongly with any recent feature attribution methods.
arXiv Detail & Related papers (2022-05-09T21:07:39Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
DISSECT: Disentangled Simultaneous Explanations via Concept Traversals [33.65478845353047]
DISSECT is a novel approach to explaining deep learning model inferences. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts. We show that DISSECT produces CTs that disentangle several concepts and are coupled to its reasoning due to joint training.
arXiv Detail & Related papers (2021-05-31T17:11:56Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Fundamental Limits and Tradeoffs in Invariant Representation Learning [99.2368462915979]
Many machine learning applications involve learning representations that achieve two competing goals. Minimax game-theoretic formulation represents a fundamental tradeoff between accuracy and invariance. We provide an information-theoretic analysis of this general and important problem under both classification and regression settings.
arXiv Detail & Related papers (2020-12-19T15:24:04Z)
Towards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End [17.226134854746267]
We present a method to generate feature attribution explanations from a set of counterfactual examples. We show how counterfactual examples can be used to evaluate the goodness of an attribution-based explanation in terms of its necessity and sufficiency.
arXiv Detail & Related papers (2020-11-10T05:41:43Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions [55.660255727031725]
Influence functions explain the decisions of a model by identifying influential training examples. We conduct a comparison between influence functions and common word-saliency methods on representative tasks. We develop a new measure based on influence functions that can reveal artifacts in training data.
arXiv Detail & Related papers (2020-05-14T00:45:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.