MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
- URL: http://arxiv.org/abs/2505.04015v1
- Date: Tue, 06 May 2025 23:26:25 GMT
- Title: MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
- Authors: Soheil Zibakhsh Shabgahi, Yaman Jandali, Farinaz Koushanfar,
- Abstract summary: Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class.<n>The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers.<n>Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate.
- Score: 12.419807304747309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.
Related papers
- TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models [67.06525001375722]
TrojanTO is the first action-level backdoor attack against TO models.<n>It implants backdoor attacks across diverse tasks and attack objectives with a low attack budget.<n>TrojanTO exhibits broad applicability to DT, GDT, and DC.
arXiv Detail & Related papers (2025-06-15T11:27:49Z) - Defense Against Model Extraction Attacks on Recommender Systems [53.127820987326295]
We introduce Gradient-based Ranking Optimization (GRO) to defend against model extraction attacks on recommender systems.
GRO aims to minimize the loss of the protected target model while maximizing the loss of the attacker's surrogate model.
Results show GRO's superior effectiveness in defending against model extraction attacks.
arXiv Detail & Related papers (2023-10-25T03:30:42Z) - TRIGS: Trojan Identification from Gradient-based Signatures [13.37492199234584]
Training machine learning models can be very expensive or even unaffordable.<n>Pre-trained models can be infected with Trojan attacks.<n>We present a novel method for detecting Trojan models.
arXiv Detail & Related papers (2023-06-08T02:17:29Z) - AdaptGuard: Defending Against Universal Attacks for Model Adaptation [129.2012687550069]
We study the vulnerability to universal attacks transferred from the source domain during model adaptation algorithms.
We propose a model preprocessing framework, named AdaptGuard, to improve the security of model adaptation algorithms.
arXiv Detail & Related papers (2023-03-19T07:53:31Z) - TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets [74.12197473591128]
We propose an effective Trojan attack against diffusion models, TrojDiff.
In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution.
We show that TrojDiff always achieves high attack performance under different adversarial targets using different types of triggers.
arXiv Detail & Related papers (2023-03-10T08:01:23Z) - Game of Trojans: A Submodular Byzantine Approach [9.512062990461212]
We provide an analytical characterization of adversarial capability and strategic interactions between the adversary and detection mechanism.
We propose a Submodular Trojan algorithm to determine the minimal fraction of samples to inject a Trojan trigger.
We show that the adversary wins the game with probability one, thus bypassing detection.
arXiv Detail & Related papers (2022-07-13T03:12:26Z) - Trojan Horse Training for Breaking Defenses against Backdoor Attacks in
Deep Learning [7.3007220721129364]
ML models that contain a backdoor are called Trojan models.
Current single-target backdoor attacks require one trigger per target class.
We introduce a new, more general attack that will enable a single trigger to result in misclassification to more than one target class.
arXiv Detail & Related papers (2022-03-25T02:54:27Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.