A Study of the Attention Abnormality in Trojaned BERTs
- URL: http://arxiv.org/abs/2205.08305v1
- Date: Fri, 13 May 2022 16:48:37 GMT
- Title: A Study of the Attention Abnormality in Trojaned BERTs
- Authors: Weimin Lyu, Songzhu Zheng, Tengfei Ma, Chao Chen
- Abstract summary: Trojan attacks raise serious security concerns.
We observe the attention focus drifting behavior of Trojaned models.
We propose an attention-based Trojan detector to distinguish Trojaned models from clean ones.
- Score: 12.623010398576067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trojan attacks raise serious security concerns. In this paper, we investigate
the underlying mechanism of Trojaned BERT models. We observe the attention
focus drifting behavior of Trojaned models, i.e., when encountering an poisoned
input, the trigger token hijacks the attention focus regardless of the context.
We provide a thorough qualitative and quantitative analysis of this phenomenon,
revealing insights into the Trojan mechanism. Based on the observation, we
propose an attention-based Trojan detector to distinguish Trojaned models from
clean ones. To the best of our knowledge, this is the first paper to analyze
the Trojan mechanism and to develop a Trojan detector based on the
transformer's attention.
Related papers
- Analyzing Multi-Head Attention on Trojan BERT Models [2.465689259704613]
This project investigates the behavior of multi-head attention in Transformer models, specifically focusing on the differences between benign and trojan models in the context of sentiment analysis.
We characterize attention head functions in trojan and benign models, identifying specific 'trojan' heads and analyzing their behavior.
arXiv Detail & Related papers (2024-06-12T06:43:59Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - Attention Hijacking in Trojan Transformers [68.04317938014067]
Trojan attacks pose a severe threat to AI systems.
Recent works on Transformer models received explosive popularity.
Can we reveal the Trojans through attention mechanisms in BERTs and ViTs?
arXiv Detail & Related papers (2022-08-09T04:05:04Z) - Game of Trojans: A Submodular Byzantine Approach [9.512062990461212]
We provide an analytical characterization of adversarial capability and strategic interactions between the adversary and detection mechanism.
We propose a Submodular Trojan algorithm to determine the minimal fraction of samples to inject a Trojan trigger.
We show that the adversary wins the game with probability one, thus bypassing detection.
arXiv Detail & Related papers (2022-07-13T03:12:26Z) - Towards Effective and Robust Neural Trojan Defenses via Input Filtering [67.01177442955522]
Trojan attacks on deep neural networks are both dangerous and surreptitious.
Over the past few years, Trojan attacks have advanced from using only a simple trigger and targeting only one class to using many sophisticated triggers and targeting multiple classes.
Most defense methods still make out-of-date assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks.
arXiv Detail & Related papers (2022-02-24T15:41:37Z) - Practical Detection of Trojan Neural Networks: Data-Limited and
Data-Free Cases [87.69818690239627]
We study the problem of the Trojan network (TrojanNet) detection in the data-scarce regime.
We propose a data-limited TrojanNet detector (TND), when only a few data samples are available for TrojanNet detection.
In addition, we propose a data-free TND, which can detect a TrojanNet without accessing any data samples.
arXiv Detail & Related papers (2020-07-31T02:00:38Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - An Embarrassingly Simple Approach for Trojan Attack in Deep Neural
Networks [59.42357806777537]
trojan attack aims to attack deployed deep neural networks (DNNs) relying on hidden trigger patterns inserted by hackers.
We propose a training-free attack approach which is different from previous work, in which trojaned behaviors are injected by retraining model on a poisoned dataset.
The proposed TrojanNet has several nice properties including (1) it activates by tiny trigger patterns and keeps silent for other signals, (2) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios, and (3) the training-free mechanism saves massive training efforts compared to conventional trojan attack methods.
arXiv Detail & Related papers (2020-06-15T04:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.