On Trojan Signatures in Large Language Models of Code
- URL: http://arxiv.org/abs/2402.16896v2
- Date: Thu, 7 Mar 2024 15:59:17 GMT
- Title: On Trojan Signatures in Large Language Models of Code
- Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour
- Abstract summary: Trojan signatures are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model.
Our results suggest that trojan signatures could not generalize to LLMs of code.
This is the first work to examine weight-based trojan signature revelation techniques for large-language models of code.
- Score: 4.838807847761728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trojan signatures, as described by Fields et al. (2021), are noticeable
differences in the distribution of the trojaned class parameters (weights) and
the non-trojaned class parameters of the trojaned model, that can be used to
detect the trojaned model. Fields et al. (2021) found trojan signatures in
computer vision classification tasks with image models, such as, Resnet,
WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in
the classifier layer parameters of large language models of source code.
Our results suggest that trojan signatures could not generalize to LLMs of
code. We found that trojaned code models are stubborn, even when the models
were poisoned under more explicit settings (finetuned with pre-trained weights
frozen). We analyzed nine trojaned models for two binary classification tasks:
clone and defect detection. To the best of our knowledge, this is the first
work to examine weight-based trojan signature revelation techniques for
large-language models of code and furthermore to demonstrate that detecting
trojans only from the weights in such models is a hard problem.
Related papers
- Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy [11.075592348442225]
Large language models (LLMs) have provided a lot of exciting new capabilities in software development.
The opaque nature of these models makes them difficult to reason about and inspect.
This work presents an overview of the current state-of-the-art trojan attacks on large language models of code.
arXiv Detail & Related papers (2024-05-05T06:43:52Z) - TrojanedCM: A Repository of Trojaned Large Language Models of Code [4.838807847761728]
TrojanedCM is a publicly available repository of clean and poisoned models of source code.
We provide poisoned models for two code classification tasks (defect detection and clone detection) and a code generation task.
The repository also provides full access to the architecture and parameters of the models, allowing practitioners to investigate different white-box analysis techniques.
arXiv Detail & Related papers (2023-11-24T21:58:06Z) - TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets [74.12197473591128]
We propose an effective Trojan attack against diffusion models, TrojDiff.
In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution.
We show that TrojDiff always achieves high attack performance under different adversarial targets using different types of triggers.
arXiv Detail & Related papers (2023-03-10T08:01:23Z) - Game of Trojans: A Submodular Byzantine Approach [9.512062990461212]
We provide an analytical characterization of adversarial capability and strategic interactions between the adversary and detection mechanism.
We propose a Submodular Trojan algorithm to determine the minimal fraction of samples to inject a Trojan trigger.
We show that the adversary wins the game with probability one, thus bypassing detection.
arXiv Detail & Related papers (2022-07-13T03:12:26Z) - Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free [126.15842954405929]
Trojan attacks threaten deep neural networks (DNNs) by poisoning them to behave normally on most samples, yet to produce manipulated results for inputs attached with a trigger.
We propose a novel Trojan network detection regime: first locating a "winning Trojan lottery ticket" which preserves nearly full Trojan information yet only chance-level performance on clean inputs; then recovering the trigger embedded in this already isolated subnetwork.
arXiv Detail & Related papers (2022-05-24T06:33:31Z) - Trojan Signatures in DNN Weights [20.93172486021463]
We present the first ultra light-weight and highly effective trojan detection method that does not require access to the training/test data.
Our approach focuses on analysis of the weights of the final, linear layer of the network.
We show that the distribution of the weights associated with the trojan target class is clearly distinguishable from the weights associated with other classes.
arXiv Detail & Related papers (2021-09-07T03:07:03Z) - Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [92.43879594465422]
In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models.
We propose a method to verify if a pre-trained model is Trojaned or benign.
Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients.
arXiv Detail & Related papers (2020-07-28T19:00:40Z) - Odyssey: Creation, Analysis and Detection of Trojan Models [91.13959405645959]
Trojan attacks interfere with the training pipeline by inserting triggers into some of the training samples and trains the model to act maliciously only for samples that contain the trigger.
Existing Trojan detectors make strong assumptions about the types of triggers and attacks.
We propose a detector that is based on the analysis of the intrinsic properties; that are affected due to the Trojaning process.
arXiv Detail & Related papers (2020-07-16T06:55:00Z) - An Embarrassingly Simple Approach for Trojan Attack in Deep Neural
Networks [59.42357806777537]
trojan attack aims to attack deployed deep neural networks (DNNs) relying on hidden trigger patterns inserted by hackers.
We propose a training-free attack approach which is different from previous work, in which trojaned behaviors are injected by retraining model on a poisoned dataset.
The proposed TrojanNet has several nice properties including (1) it activates by tiny trigger patterns and keeps silent for other signals, (2) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios, and (3) the training-free mechanism saves massive training efforts compared to conventional trojan attack methods.
arXiv Detail & Related papers (2020-06-15T04:58:28Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.