Related papers: Characterizing the risk of fairwashing

Characterizing the risk of fairwashing

URL: http://arxiv.org/abs/2106.07504v1
Date: Mon, 14 Jun 2021 15:33:17 GMT
Title: Characterizing the risk of fairwashing
Authors: Ulrich A\"ivodji, Hiromi Arai, S\'ebastien Gambs, Satoshi Hara
Abstract summary: We show that it is possible to build high-fidelity explanation models with low unfairness. We show that fairwashed explanation models can generalize beyond the suing group. We conclude that fairwashing attacks can transfer across black-box models.
Score: 8.545202841051582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fairwashing refers to the risk that an unfair black-box model can be explained by a fairer model through post-hoc explanations' manipulation. However, to realize this, the post-hoc explanation model must produce different predictions than the original black-box on some inputs, leading to a decrease in the fidelity imposed by the difference in unfairness. In this paper, our main objective is to characterize the risk of fairwashing attacks, in particular by investigating the fidelity-unfairness trade-off. First, we demonstrate through an in-depth empirical study on black-box models trained on several real-world datasets and for several statistical notions of fairness that it is possible to build high-fidelity explanation models with low unfairness. For instance, we find that fairwashed explanation models can exhibit up to $99.20\%$ fidelity to the black-box models they explain while being $50\%$ less unfair. These results suggest that fidelity alone should not be used as a proxy for the quality of black-box explanations. Second, we show that fairwashed explanation models can generalize beyond the suing group (\emph{i.e.}, data points that are being explained), which will only worsen as more stable fairness methods get developed. Finally, we demonstrate that fairwashing attacks can transfer across black-box models, meaning that other black-box models can perform fairwashing without explicitly using their predictions.

Related papers

"Patriarchy Hurts Men Too." Does Your Model Agree? A Discussion on Fairness Assumptions [3.706222947143855]
In the context of group fairness, this approach often obscures implicit assumptions about how bias is introduced into the data. We are assuming that the biasing process is a monotonic function of the fair scores, dependent solely on the sensitive attribute. Either the behavior of the biasing process is more complex than mere monotonicity, which means we need to identify and reject our implicit assumptions.
arXiv Detail & Related papers (2024-08-01T07:06:30Z)
Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability [29.459228981179674]
Post hoc explanations incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture. We propose Distractor Erasure Tuning (DiET), a method that adapts black-box models to be robust to distractor erasure.
arXiv Detail & Related papers (2023-07-27T17:06:02Z)
Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age. A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data. In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z)
DualFair: Fair Representation Learning at Both Group and Individual Levels via Contrastive Self-supervision [73.80009454050858]
This work presents a self-supervised model, called DualFair, that can debias sensitive attributes like gender and race from learned representations. Our model jointly optimize for two fairness criteria - group fairness and counterfactual fairness.
arXiv Detail & Related papers (2023-03-15T07:13:54Z)
Bi-Noising Diffusion: Towards Conditional Diffusion Models with Generative Restoration Priors [64.24948495708337]
We introduce a new method that brings predicted samples to the training data manifold using a pretrained unconditional diffusion model. We perform comprehensive experiments to demonstrate the effectiveness of our approach on super-resolution, colorization, turbulence removal, and image-deraining tasks.
arXiv Detail & Related papers (2022-12-14T17:26:35Z)
Revealing Unfair Models by Mining Interpretable Evidence [50.48264727620845]
The popularity of machine learning has increased the risk of unfair models getting deployed in high-stake applications. In this paper, we tackle the novel task of revealing unfair models by mining interpretable evidence. Our method finds highly interpretable and solid evidence to effectively reveal the unfairness of trained models.
arXiv Detail & Related papers (2022-07-12T20:03:08Z)
What will it take to generate fairness-preserving explanations? [15.801388187383973]
We focus on explanations applied to datasets, suggesting that explanations do not necessarily preserve the fairness properties of the black-box algorithm. We propose future research directions for evaluating and generating explanations such that they are informative and relevant from a fairness perspective.
arXiv Detail & Related papers (2021-06-24T23:03:25Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Biased Models Have Biased Explanations [10.9397029555303]
We study fairness in Machine Learning (FairML) through the lens of attribute-based explanations generated for machine learning models. We first translate existing statistical notions of group fairness and define these notions in terms of explanations given by the model. Then, we propose a novel way of detecting (un)fairness for any black box model.
arXiv Detail & Related papers (2020-12-20T18:09:45Z)
Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z)
Interpretable Companions for Black-Box Models [13.39487972552112]
We present an interpretable companion model for any pre-trained black-box classifiers. For any input, a user can decide to either receive a prediction from the black-box model, with high accuracy but no explanations, or employ a companion rule to obtain an interpretable prediction with slightly lower accuracy. The companion model is trained from data and the predictions of the black-box model, with the objective combining area under the transparency--accuracy curve and model complexity.
arXiv Detail & Related papers (2020-02-10T01:39:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.