Trojan Model Detection Using Activation Optimization
- URL: http://arxiv.org/abs/2306.04877v2
- Date: Tue, 6 Feb 2024 02:46:28 GMT
- Title: Trojan Model Detection Using Activation Optimization
- Authors: Mohamed E. Hussein, Sudharshan Subramaniam Janakiraman, Wael
AbdAlmageed
- Abstract summary: Training machine learning models can be very expensive or even unaffordable.
Pre-trained models can be infected with Trojan attacks.
We present a novel method for detecting Trojan models.
- Score: 15.032071953322594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training machine learning models can be very expensive or even unaffordable.
This may be, for example, due to data limitations (unavailability or being too
large), or computational power limitations. Therefore, it is a common practice
to rely on open-source pre-trained models whenever possible. However, this
practice is alarming from a security perspective. Pre-trained models can be
infected with Trojan attacks, in which the attacker embeds a trigger in the
model such that the model's behavior can be controlled by the attacker when the
trigger is present in the input. In this paper, we present a novel method for
detecting Trojan models. Our method creates a signature for a model based on
activation optimization. A classifier is then trained to detect a Trojan model
given its signature. We call our method TRIGS for TRojan Identification from
Gradient-based Signatures. TRIGS achieves state-of-the-art performance on two
public datasets of convolutional models. Additionally, we introduce a new
challenging dataset of ImageNet models based on the vision transformer
architecture. TRIGS delivers the best performance on the new dataset,
surpassing the baseline methods by a large margin. Our experiments also show
that TRIGS requires only a small amount of clean samples to achieve good
performance, and works reasonably well even if the defender does not have prior
knowledge about the attacker's model architecture. Our dataset will be released
soon.
Related papers
- Probabilistically Robust Watermarking of Neural Networks [4.332441337407564]
We introduce a novel trigger set-based watermarking approach that demonstrates resilience against functionality stealing attacks.
Our approach does not require additional model training and can be applied to any model architecture.
arXiv Detail & Related papers (2024-01-16T10:32:13Z) - A Plot is Worth a Thousand Words: Model Information Stealing Attacks via
Scientific Plots [14.998272283348152]
It is well known that an adversary can leverage a target ML model's output to steal the model's information.
We propose a new side channel for model information stealing attacks, i.e., models' scientific plots.
arXiv Detail & Related papers (2023-02-23T12:57:34Z) - Are You Stealing My Model? Sample Correlation for Fingerprinting Deep
Neural Networks [86.55317144826179]
Previous methods always leverage the transferable adversarial examples as the model fingerprint.
We propose a novel yet simple model stealing detection method based on SAmple Correlation (SAC)
SAC successfully defends against various model stealing attacks, even including adversarial training or transfer learning.
arXiv Detail & Related papers (2022-10-21T02:07:50Z) - MOVE: Effective and Harmless Ownership Verification via Embedded
External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.
We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.
In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Defending against Model Stealing via Verifying Embedded External
Features [90.29429679125508]
adversaries can steal' deployed models even when they have no training samples and can not get access to the model parameters or structures.
We explore the defense from another angle by verifying whether a suspicious model contains the knowledge of defender-specified emphexternal features.
Our method is effective in detecting different types of model stealing simultaneously, even if the stolen model is obtained via a multi-stage stealing process.
arXiv Detail & Related papers (2021-12-07T03:51:54Z) - Practical No-box Adversarial Attacks against DNNs [31.808770437120536]
We investigate no-box adversarial examples, where the attacker can neither access the model information or the training set nor query the model.
We propose three mechanisms for training with a very small dataset and find that prototypical reconstruction is the most effective.
Our approach significantly diminishes the average prediction accuracy of the system to only 15.40%, which is on par with the attack that transfers adversarial examples from a pre-trained Arcface model.
arXiv Detail & Related papers (2020-12-04T11:10:03Z) - Learnable Boundary Guided Adversarial Training [66.57846365425598]
We use the model logits from one clean model to guide learning of another one robust model.
We achieve new state-of-the-art robustness on CIFAR-100 without additional real or synthetic data.
arXiv Detail & Related papers (2020-11-23T01:36:05Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.