Online Defense of Trojaned Models using Misattributions
- URL: http://arxiv.org/abs/2103.15918v1
- Date: Mon, 29 Mar 2021 19:53:44 GMT
- Title: Online Defense of Trojaned Models using Misattributions
- Authors: Panagiota Kiourti, Wenchao Li, Anirban Roy, Karan Sikka, and Susmit
Jha
- Abstract summary: This paper proposes a new approach to detecting neural Trojans on Deep Neural Networks during inference.
We evaluate our approach on several benchmarks, including models trained on MNIST, Fashion MNIST, and German Traffic Sign Recognition Benchmark.
- Score: 18.16378666013071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new approach to detecting neural Trojans on Deep Neural
Networks during inference. This approach is based on monitoring the inference
of a machine learning model, computing the attribution of the model's decision
on different features of the input, and then statistically analyzing these
attributions to detect whether an input sample contains the Trojan trigger. The
anomalous attributions, aka misattributions, are then accompanied by
reverse-engineering of the trigger to evaluate whether the input sample is
truly poisoned with a Trojan trigger. We evaluate our approach on several
benchmarks, including models trained on MNIST, Fashion MNIST, and German
Traffic Sign Recognition Benchmark, and demonstrate the state of the art
detection accuracy.
Related papers
- Solving Trojan Detection Competitions with Linear Weight Classification [1.24275433420322]
We introduce a detector that works remarkably well across many of the existing datasets and domains.
We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains.
arXiv Detail & Related papers (2024-11-05T19:00:34Z) - Risk-Aware and Explainable Framework for Ensuring Guaranteed Coverage in Evolving Hardware Trojan Detection [2.6396287656676733]
In high-risk and sensitive domain, we cannot accept even a small misclassification.
In this paper, we generate evolving hardware Trojans using our proposed novel conformalized generative adversarial networks.
The proposed approach has been validated on both synthetic and real chip-level benchmarks.
arXiv Detail & Related papers (2023-10-14T03:30:21Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - FreeEagle: Detecting Complex Neural Trojans in Data-Free Cases [50.065022493142116]
Trojan attack on deep neural networks, also known as backdoor attack, is a typical threat to artificial intelligence.
FreeEagle is the first data-free backdoor detection method that can effectively detect complex backdoor attacks.
arXiv Detail & Related papers (2023-02-28T11:31:29Z) - PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework
on NLP Applications [21.854581570954075]
Trojan attacks embed the backdoor into the victim and is activated by the trigger in the input space.
We propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input.
We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI.
arXiv Detail & Related papers (2022-08-08T22:50:03Z) - Adversarial Examples Detection with Bayesian Neural Network [57.185482121807716]
We propose a new framework to detect adversarial examples motivated by the observations that random components can improve the smoothness of predictors.
We propose a novel Bayesian adversarial example detector, short for BATer, to improve the performance of adversarial example detection.
arXiv Detail & Related papers (2021-05-18T15:51:24Z) - Detecting Trojaned DNNs Using Counterfactual Attributions [15.988574580713328]
Such models behave normally with typical inputs but produce specific incorrect predictions for inputs with a Trojan trigger.
Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern.
We use this information for Trojan detection by using a deep set encoder.
arXiv Detail & Related papers (2020-12-03T21:21:33Z) - Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models.
We propose a method to verify if a pre-trained model is Trojaned or benign.
Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.