Related papers: PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection

PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection

URL: http://arxiv.org/abs/2406.05826v1
Date: Sun, 9 Jun 2024 15:31:00 GMT
Title: PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
Authors: Wei Li, Pin-Yu Chen, Sijia Liu, Ren Wang,
Abstract summary: Prediction Shift Backdoor Detection (PSBD) is a novel method for identifying backdoor samples in deep neural networks. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference.
Score: 57.571451139201855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a lack of direct filtering methods for identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.

Related papers

BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Backdoor Defense through Self-Supervised and Generative Learning [0.0]
Training on such data injects a backdoor which causes malicious inference in selected test samples. This paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space. In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset.
arXiv Detail & Related papers (2024-09-02T11:40:01Z)
IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling Consistency [20.61046457594186]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) to filter out malicious testing images.
arXiv Detail & Related papers (2024-05-16T03:19:52Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
XGBD: Explanation-Guided Graph Backdoor Detection [21.918945251903523]
Backdoor attacks pose a significant security risk to graph learning models. We propose an explanation-guided backdoor detection method to take advantage of the topological information.
arXiv Detail & Related papers (2023-08-08T17:10:23Z)
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
Invisible Backdoor Attacks Using Data Poisoning in the Frequency Domain [8.64369418938889]
We propose a generalized backdoor attack method based on the frequency domain. It can implement backdoor implantation without mislabeling and accessing the training process. We evaluate our approach in the no-label and clean-label cases on three datasets.
arXiv Detail & Related papers (2022-07-09T07:05:53Z)
PiDAn: A Coherence Optimization Approach for Backdoor Attack Detection and Mitigation in Deep Neural Networks [22.900501880865658]
Backdoor attacks impose a new threat in Deep Neural Networks (DNNs) We propose PiDAn, an algorithm based on coherence optimization purifying the poisoned data. Our PiDAn algorithm can detect more than 90% infected classes and identify 95% poisoned samples.
arXiv Detail & Related papers (2022-03-17T12:37:21Z)
DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows [52.31831255787147]
We introduce a novel technique, DAAIN, to detect out-of-distribution (OOD) inputs and adversarial attacks (AA) Our approach monitors the inner workings of a neural network and learns a density estimator of the activation distribution. Our model can be trained on a single GPU making it compute efficient and deployable without requiring specialized accelerators.
arXiv Detail & Related papers (2021-05-30T22:07:13Z)
Black-box Detection of Backdoor Attacks with Limited Information and Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model. In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
The Hidden Uncertainty in a Neural Networks Activations [105.4223982696279]
The distribution of a neural network's latent representations has been successfully used to detect out-of-distribution (OOD) data. This work investigates whether this distribution correlates with a model's epistemic uncertainty, thus indicating its ability to generalise to novel inputs.
arXiv Detail & Related papers (2020-12-05T17:30:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.