Analyzing Multi-Head Attention on Trojan BERT Models
- URL: http://arxiv.org/abs/2406.16925v1
- Date: Wed, 12 Jun 2024 06:43:59 GMT
- Title: Analyzing Multi-Head Attention on Trojan BERT Models
- Authors: Jingwei Wang,
- Abstract summary: This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis.
We characterize attention head functions in trojan and benign models, identifying specific 'trojan' heads and analyzing their behavior.
- Score: 2.465689259704613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis. Trojan attacks cause models to perform normally on clean inputs but exhibit misclassifications when presented with inputs containing predefined triggers. We characterize attention head functions in trojan and benign models, identifying specific 'trojan' heads and analyzing their behavior.
Related papers
- Solving Trojan Detection Competitions with Linear Weight Classification [1.24275433420322]
We introduce a detector that works remarkably well across many of the existing datasets and domains.
We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains.
arXiv Detail & Related papers (2024-11-05T19:00:34Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - Attention Hijacking in Trojan Transformers [68.04317938014067]
Trojan attacks pose a severe threat to AI systems.
Recent works on Transformer models received explosive popularity.
Can we reveal the Trojans through attention mechanisms in BERTs and ViTs?
arXiv Detail & Related papers (2022-08-09T04:05:04Z) - A Study of the Attention Abnormality in Trojaned BERTs [12.623010398576067]
Trojan attacks raise serious security concerns.
We observe the attention focus drifting behavior of Trojaned models.
We propose an attention-based Trojan detector to distinguish Trojaned models from clean ones.
arXiv Detail & Related papers (2022-05-13T16:48:37Z) - Topological Detection of Trojaned Neural Networks [10.559903139528252]
Trojan attacks occur when attackers stealthily manipulate the model's behavior.
We find subtle structural deviation characterizing Trojaned models.
We devise a strategy for robust detection of Trojaned models.
arXiv Detail & Related papers (2021-06-11T15:48:16Z) - Detecting Trojaned DNNs Using Counterfactual Attributions [15.988574580713328]
Such models behave normally with typical inputs but produce specific incorrect predictions for inputs with a Trojan trigger.
Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern.
We use this information for Trojan detection by using a deep set encoder.
arXiv Detail & Related papers (2020-12-03T21:21:33Z) - Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models.
We propose a method to verify if a pre-trained model is Trojaned or benign.
Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - An Embarrassingly Simple Approach for Trojan Attack in Deep Neural
Networks [59.42357806777537]
trojan attack aims to attack deployed deep neural networks (DNNs) relying on hidden trigger patterns inserted by hackers.
We propose a training-free attack approach which is different from previous work, in which trojaned behaviors are injected by retraining model on a poisoned dataset.
The proposed TrojanNet has several nice properties including (1) it activates by tiny trigger patterns and keeps silent for other signals, (2) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios, and (3) the training-free mechanism saves massive training efforts compared to conventional trojan attack methods.
arXiv Detail & Related papers (2020-06-15T04:58:28Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.