Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous
Dimensions in Pre-trained Language Models Caused by Backdoor or Bias
- URL: http://arxiv.org/abs/2305.04547v1
- Date: Mon, 8 May 2023 08:40:30 GMT
- Title: Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous
Dimensions in Pre-trained Language Models Caused by Backdoor or Bias
- Authors: Zhiyuan Zhang, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
- Abstract summary: Pre-trained Language Models (PLMs) may be poisonous with backdoors or bias injected by the suspicious attacker during the fine-tuning process.
We propose the Fine-purifying approach, which utilizes the diffusion theory to study the dynamic process of fine-tuning for finding potentially poisonous dimensions.
To the best of our knowledge, we are the first to study the dynamics guided by the diffusion theory for safety or defense purposes.
- Score: 64.81358555107788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Language Models (PLMs) may be poisonous with backdoors or bias
injected by the suspicious attacker during the fine-tuning process. A core
challenge of purifying potentially poisonous PLMs is precisely finding
poisonous dimensions. To settle this issue, we propose the Fine-purifying
approach, which utilizes the diffusion theory to study the dynamic process of
fine-tuning for finding potentially poisonous dimensions. According to the
relationship between parameter drifts and Hessians of different dimensions, we
can detect poisonous dimensions with abnormal dynamics, purify them by
resetting them to clean pre-trained weights, and then fine-tune the purified
weights on a small clean dataset. To the best of our knowledge, we are the
first to study the dynamics guided by the diffusion theory for safety or
defense purposes. Experimental results validate the effectiveness of
Fine-purifying even with a small clean dataset.
Related papers
- Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization [39.37308843208039]
We introduce a more threatening type of poisoning attack called the Deferred Poisoning Attack.
This new attack allows the model to function normally during the training and validation phases but makes it very sensitive to evasion attacks or even natural noise.
We have conducted both theoretical and empirical analyses of the proposed method and validated its effectiveness through experiments on image classification tasks.
arXiv Detail & Related papers (2024-11-06T08:27:49Z) - ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification [29.28977815669999]
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images.
We propose a more universally effective, practical, and robust defense scheme called ECLIPSE.
arXiv Detail & Related papers (2024-06-21T12:14:24Z) - PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection [57.571451139201855]
Prediction Shift Backdoor Detection (PSBD) is a novel method for identifying backdoor samples in deep neural networks.
PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels.
PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference.
arXiv Detail & Related papers (2024-06-09T15:31:00Z) - Towards Understanding the Robustness of Diffusion-Based Purification: A Stochastic Perspective [65.10019978876863]
Diffusion-Based Purification (DBP) has emerged as an effective defense mechanism against adversarial attacks.
In this paper, we argue that the inherentity in the DBP process is the primary driver of its robustness.
arXiv Detail & Related papers (2024-04-22T16:10:38Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Reconstructing Graph Diffusion History from a Single Snapshot [87.20550495678907]
We propose a novel barycenter formulation for reconstructing Diffusion history from A single SnapsHot (DASH)
We prove that estimation error of diffusion parameters is unavoidable due to NP-hardness of diffusion parameter estimation.
We also develop an effective solver named DIffusion hiTting Times with Optimal proposal (DITTO)
arXiv Detail & Related papers (2023-06-01T09:39:32Z) - Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning [27.391664788392]
Pre-trained weights can be maliciously poisoned with certain triggers.
Fine-tuned model will predict pre-defined labels, causing a security threat.
arXiv Detail & Related papers (2021-08-31T14:47:37Z) - Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning.
Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.