The Road to Explainability is Paved with Bias: Measuring the Fairness of
Explanations
- URL: http://arxiv.org/abs/2205.03295v2
- Date: Thu, 2 Jun 2022 17:01:15 GMT
- Title: The Road to Explainability is Paved with Bias: Measuring the Fairness of
Explanations
- Authors: Aparna Balagopalan, Haoran Zhang, Kimia Hamidieh, Thomas Hartvigsen,
Frank Rudzicz, Marzyeh Ghassemi
- Abstract summary: Post-hoc explainability methods are often proposed to help users trust model predictions.
We use real data from four settings in finance, healthcare, college admissions, and the US justice system.
We find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups.
- Score: 30.248116795946977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning models in safety-critical settings like healthcare are often
blackboxes: they contain a large number of parameters which are not transparent
to users. Post-hoc explainability methods where a simple, human-interpretable
model imitates the behavior of these blackbox models are often proposed to help
users trust model predictions. In this work, we audit the quality of such
explanations for different protected subgroups using real data from four
settings in finance, healthcare, college admissions, and the US justice system.
Across two different blackbox model architectures and four popular
explainability methods, we find that the approximation quality of explanation
models, also known as the fidelity, differs significantly between subgroups. We
also demonstrate that pairing explainability methods with recent advances in
robust machine learning can improve explanation fairness in some settings.
However, we highlight the importance of communicating details of non-zero
fidelity gaps to users, since a single solution might not exist across all
settings. Finally, we discuss the implications of unfair explanation models as
a challenging and understudied problem facing the machine learning community.
Related papers
- DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation [21.172795461188578]
We propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample.
A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples.
We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space.
arXiv Detail & Related papers (2024-06-02T04:01:08Z) - Learning with Explanation Constraints [91.23736536228485]
We provide a learning theoretic framework to analyze how explanations can improve the learning of our models.
We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.
arXiv Detail & Related papers (2023-03-25T15:06:47Z) - Revealing Unfair Models by Mining Interpretable Evidence [50.48264727620845]
The popularity of machine learning has increased the risk of unfair models getting deployed in high-stake applications.
In this paper, we tackle the novel task of revealing unfair models by mining interpretable evidence.
Our method finds highly interpretable and solid evidence to effectively reveal the unfairness of trained models.
arXiv Detail & Related papers (2022-07-12T20:03:08Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Interpretation of Black Box NLP Models: A Survey [0.0]
Post hoc explanations based on perturbations are widely used approaches to interpret a machine learning model after it has been built.
We propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation.
arXiv Detail & Related papers (2022-03-31T14:54:35Z) - Feature Attributions and Counterfactual Explanations Can Be Manipulated [32.579094387004346]
We show how adversaries can design biased models that manipulate model agnostic feature attribution methods.
These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model.
We evaluate the manipulations on real world data sets, including COMPAS and Communities & Crime, and find explanations can be manipulated in practice.
arXiv Detail & Related papers (2021-06-23T17:43:31Z) - S-LIME: Stabilized-LIME for Model Explanation [7.479279851480736]
Post hoc explanations based on perturbations are widely used approaches to interpret a machine learning model after it has been built.
We propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation.
arXiv Detail & Related papers (2021-06-15T04:24:59Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z) - Explainable Recommender Systems via Resolving Learning Representations [57.24565012731325]
Explanations could help improve user experience and discover system defects.
We propose a novel explainable recommendation model through improving the transparency of the representation learning process.
arXiv Detail & Related papers (2020-08-21T05:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.