TrojFair: Trojan Fairness Attacks
- URL: http://arxiv.org/abs/2312.10508v1
- Date: Sat, 16 Dec 2023 17:36:23 GMT
- Title: TrojFair: Trojan Fairness Attacks
- Authors: Mengxin Zheng, Jiaqi Xue, Yi Sheng, Lei Yang, Qian Lou, and Lei Jiang
- Abstract summary: TrojFair is a stealthy Fairness attack that is resilient to existing model fairness audition detectors.
It achieves a target group attack success rate exceeding $88.77%$, with an average accuracy loss less than 0.44%$.
It also maintains a high discriminative score between the target and non-target groups across various datasets and models.
- Score: 14.677100524907358
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning models have been incorporated into high-stakes sectors,
including healthcare diagnosis, loan approvals, and candidate recruitment,
among others. Consequently, any bias or unfairness in these models can harm
those who depend on such models. In response, many algorithms have emerged to
ensure fairness in deep learning. However, while the potential for harm is
substantial, the resilience of these fair deep learning models against
malicious attacks has never been thoroughly explored, especially in the context
of emerging Trojan attacks. Moving beyond prior research, we aim to fill this
void by introducing \textit{TrojFair}, a Trojan fairness attack. Unlike
existing attacks, TrojFair is model-agnostic and crafts a Trojaned model that
functions accurately and equitably for clean inputs. However, it displays
discriminatory behaviors \text{-} producing both incorrect and unfair results
\text{-} for specific groups with tainted inputs containing a trigger. TrojFair
is a stealthy Fairness attack that is resilient to existing model fairness
audition detectors since the model for clean inputs is fair. TrojFair achieves
a target group attack success rate exceeding $88.77\%$, with an average
accuracy loss less than $0.44\%$. It also maintains a high discriminative score
between the target and non-target groups across various datasets and models.
Related papers
- CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense [61.78357530675446]
Humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors.
Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation.
For an adversarial example, we aim to discriminate perturbations as non-causative factors and make predictions only based on the label-causative factors.
arXiv Detail & Related papers (2024-10-30T15:06:44Z) - BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers [11.406478357477292]
We introduce BadFair, a novel backdoored fairness attack methodology.
BadFair stealthily crafts a model that operates with accuracy and fairness under regular conditions but, when activated by certain triggers, discriminates and produces incorrect results for specific groups.
Our findings reveal that BadFair achieves a more than 85% attack success rate in attacks aimed at target groups on average while only incurring a minimal accuracy loss.
arXiv Detail & Related papers (2024-10-23T01:14:54Z) - PFAttack: Stealthy Attack Bypassing Group Fairness in Federated Learning [24.746843739848003]
Federated learning (FL) allows clients to collaboratively train a global model that makes unbiased decisions for different populations.
Previous studies have demonstrated that FL systems are vulnerable to model poisoning attacks.
We propose Profit-driven Fairness Attack (PFATTACK) which aims not to degrade global model accuracy but to bypass fairness mechanisms.
arXiv Detail & Related papers (2024-10-09T03:23:07Z) - Fairness Without Harm: An Influence-Guided Active Sampling Approach [32.173195437797766]
We aim to train models that mitigate group fairness disparity without causing harm to model accuracy.
The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes.
We propose a tractable active data sampling algorithm that does not rely on training group annotations.
arXiv Detail & Related papers (2024-02-20T07:57:38Z) - Attacks on fairness in Federated Learning [1.03590082373586]
We present a new type of attack that compromises the fairness of a trained model.
We find that by employing a threat model similar to that of a backdoor attack, an attacker is able to influence the aggregated model to have an unfair performance distribution.
arXiv Detail & Related papers (2023-11-21T16:42:03Z) - Towards Poisoning Fair Representations [26.47681999979761]
This work proposes the first data poisoning framework attacking fair representation learning methods.
We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data.
Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.
arXiv Detail & Related papers (2023-09-28T14:51:20Z) - DualFair: Fair Representation Learning at Both Group and Individual
Levels via Contrastive Self-supervision [73.80009454050858]
This work presents a self-supervised model, called DualFair, that can debias sensitive attributes like gender and race from learned representations.
Our model jointly optimize for two fairness criteria - group fairness and counterfactual fairness.
arXiv Detail & Related papers (2023-03-15T07:13:54Z) - Fairness Increases Adversarial Vulnerability [50.90773979394264]
This paper shows the existence of a dichotomy between fairness and robustness, and analyzes when achieving fairness decreases the model robustness to adversarial samples.
Experiments on non-linear models and different architectures validate the theoretical findings in multiple vision domains.
The paper proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
arXiv Detail & Related papers (2022-11-21T19:55:35Z) - Towards Fair Classification against Poisoning Attacks [52.57443558122475]
We study the poisoning scenario where the attacker can insert a small fraction of samples into training data.
We propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks.
arXiv Detail & Related papers (2022-10-18T00:49:58Z) - Revealing Unfair Models by Mining Interpretable Evidence [50.48264727620845]
The popularity of machine learning has increased the risk of unfair models getting deployed in high-stake applications.
In this paper, we tackle the novel task of revealing unfair models by mining interpretable evidence.
Our method finds highly interpretable and solid evidence to effectively reveal the unfairness of trained models.
arXiv Detail & Related papers (2022-07-12T20:03:08Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.