Measuring Impacts of Poisoning on Model Parameters and Neuron
Activations: A Case Study of Poisoning CodeBERT
- URL: http://arxiv.org/abs/2402.12936v2
- Date: Tue, 5 Mar 2024 09:22:01 GMT
- Title: Measuring Impacts of Poisoning on Model Parameters and Neuron
Activations: A Case Study of Poisoning CodeBERT
- Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Navid Ayoobi, Mohammad Amin
Alipour
- Abstract summary: Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen.
Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously.
In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models.
- Score: 4.305373051747465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have revolutionized software development
practices, yet concerns about their safety have arisen, particularly regarding
hidden backdoors, aka trojans. Backdoor attacks involve the insertion of
triggers into training data, allowing attackers to manipulate the behavior of
the model maliciously. In this paper, we focus on analyzing the model
parameters to detect potential backdoor signals in code models. Specifically,
we examine attention weights and biases, activation values, and context
embeddings of the clean and poisoned CodeBERT models. Our results suggest
noticeable patterns in activation values and context embeddings of poisoned
samples for the poisoned CodeBERT model; however, attention weights and biases
do not show any significant differences. This work contributes to ongoing
efforts in white-box detection of backdoor signals in LLMs of code through the
analysis of parameters and activations.
Related papers
- BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code [4.305373051747465]
Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen.
Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously.
In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models.
arXiv Detail & Related papers (2024-05-19T06:53:20Z) - Model X-ray:Detecting Backdoored Models via Decision Boundary [62.675297418960355]
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs)
We propose Model X-ray, a novel backdoor detection approach based on the analysis of illustrated two-dimensional (2D) decision boundaries.
Our approach includes two strategies focused on the decision areas dominated by clean samples and the concentration of label distribution.
arXiv Detail & Related papers (2024-02-27T12:42:07Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - Leveraging Diffusion-Based Image Variations for Robust Training on
Poisoned Data [26.551317580666353]
Backdoor attacks pose a serious security threat for training neural networks.
We propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models.
arXiv Detail & Related papers (2023-10-10T07:25:06Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification.
CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - TOP: Backdoor Detection in Neural Networks via Transferability of
Perturbation [1.52292571922932]
Detection of backdoors in trained models without access to the training data or example triggers is an important open problem.
In this paper, we identify an interesting property of these models: adversarial perturbations transfer from image to image more readily in poisoned models than in clean models.
We use this feature to detect poisoned models in the TrojAI benchmark, as well as additional models.
arXiv Detail & Related papers (2021-03-18T14:13:30Z) - Systematic Evaluation of Backdoor Data Poisoning Attacks on Image
Classifiers [6.352532169433872]
Backdoor data poisoning attacks have been demonstrated in computer vision research as a potential safety risk for machine learning (ML) systems.
Our work builds upon prior backdoor data-poisoning research for ML image classifiers.
We find that poisoned models are hard to detect through performance inspection alone.
arXiv Detail & Related papers (2020-04-24T02:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.