Related papers: Detecting Trojaned DNNs Using Counterfactual Attributions

Detecting Trojaned DNNs Using Counterfactual Attributions

URL: http://arxiv.org/abs/2012.02275v1
Date: Thu, 3 Dec 2020 21:21:33 GMT
Title: Detecting Trojaned DNNs Using Counterfactual Attributions
Authors: Karan Sikka, Indranil Sur, Susmit Jha, Anirban Roy and Ajay Divakaran
Abstract summary: Such models behave normally with typical inputs but produce specific incorrect predictions for inputs with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern. We use this information for Trojan detection by using a deep set encoder.
Score: 15.988574580713328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce specific incorrect predictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern and exhibit abnormally higher relative attribution for wrong decisions when activated. Further, these trigger neurons are also active on normal inputs of the target class. Thus, we use counterfactual attributions to localize these ghost neurons from clean inputs and then incrementally excite them to observe changes in the model's accuracy. We use this information for Trojan detection by using a deep set encoder that enables invariance to the number of model classes, architecture, etc. Our approach is implemented in the TrinityAI tool that exploits the synergies between trustworthiness, resilience, and interpretability challenges in deep learning. We evaluate our approach on benchmarks with high diversity in model architectures, triggers, etc. We show consistent gains (+10%) over state-of-the-art methods that rely on the susceptibility of the DNN to specific adversarial attacks, which in turn requires strong assumptions on the nature of the Trojan attack.

Related papers

EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody [25.134723977429076]
Speaker identification (SI) determines a speaker's identity based on their spoken utterances. Previous work indicates that SI deep neural networks (DNNs) are vulnerable to backdoor attacks. This is the first work that explores SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody.
arXiv Detail & Related papers (2024-08-02T11:00:12Z)
Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics. We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications [21.854581570954075]
Trojan attacks embed the backdoor into the victim and is activated by the trigger in the input space. We propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input. We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI.
arXiv Detail & Related papers (2022-08-08T22:50:03Z)
An Adaptive Black-box Backdoor Detection Method for Deep Neural Networks [25.593824693347113]
Deep Neural Networks (DNNs) have demonstrated unprecedented performance across various fields such as medical diagnosis and autonomous driving. They are identified to be vulnerable to Neural Trojan (NT) attacks that are controlled and activated by stealthy triggers. We propose a robust and adaptive Trojan detection scheme that inspects whether a pre-trained model has been Trojaned before its deployment.
arXiv Detail & Related papers (2022-04-08T23:41:19Z)
Trigger Hunting with a Topological Prior for Trojan Detection [16.376009231934884]
This paper tackles the problem of Trojan detection, namely, identifying Trojaned models. One popular approach is reverse engineering, recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. We propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers.
arXiv Detail & Related papers (2021-10-15T19:47:00Z)
TAD: Trigger Approximation based Black-box Trojan Detection for AI [16.741385045881113]
Deep Neural Networks (DNNs) have demonstrated unprecedented performance across various fields such as medical diagnosis and autonomous driving. They are identified to be vulnerable to Trojan (NT) attacks that are controlled and activated by the trigger. We propose a robust Trojan detection scheme that inspects whether a pre-trained AI model has been Trojaned before its deployment.
arXiv Detail & Related papers (2021-02-03T00:49:50Z)
Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z)
Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger. Existing Trojan detectors make strong assumptions about the types of triggers and attacks. We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z)
Graph Backdoor [53.70971502299977]
We present GTA, the first backdoor attack on graph neural networks (GNNs) GTA departs in significant ways: it defines triggers as specific subgraphs, including both topological structures and descriptive features. It can be instantiated for both transductive (e.g., node classification) and inductive (e.g., graph classification) tasks.
arXiv Detail & Related papers (2020-06-21T19:45:30Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.