Related papers: Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

URL: http://arxiv.org/abs/2509.05318v1
Date: Sat, 30 Aug 2025 06:35:32 GMT
Title: Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models
Authors: Zuquan Peng, Jianming Fu, Lixin Zou, Li Zheng, Yanzhen Ren, Guojun Peng,
Abstract summary: We propose a backdoor sample detection method based on perturbatiotextbfN discrtextbfEpancy consistextbfTency textbfEvaluation (NETE)<n>In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples.<n>Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples.
Score: 21.927099943522734
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbf{N} discr\textbf{E}pancy consis\textbf{T}ency \textbf{E}valuation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.

Related papers

BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models [12.519879298717104]
We propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms.<n>Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance.
arXiv Detail & Related papers (2025-05-29T02:49:29Z)
Test-Time Backdoor Detection for Object Detection Models [14.69149115853361]
Object detection models are vulnerable to backdoor attacks.<n>Transformation Consistency Evaluation (TRACE) is a brand-new method for detecting poisoned samples at test time in object detection.<n>TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses.
arXiv Detail & Related papers (2025-03-19T15:12:26Z)
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection [57.571451139201855]
Prediction Shift Backdoor Detection (PSBD) is a novel method for identifying backdoor samples in deep neural networks.<n>PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels.<n>PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference.
arXiv Detail & Related papers (2024-06-09T15:31:00Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
XGBD: Explanation-Guided Graph Backdoor Detection [21.918945251903523]
Backdoor attacks pose a significant security risk to graph learning models. We propose an explanation-guided backdoor detection method to take advantage of the topological information.
arXiv Detail & Related papers (2023-08-08T17:10:23Z)
Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency [33.42013309686333]
We propose a test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Extensive experiments demonstrate that compared with state-of-the-art defenses, TeCo outperforms them on different backdoor attacks, datasets, and model architectures.
arXiv Detail & Related papers (2023-03-27T07:10:37Z)
Training set cleansing of backdoor poisoning by self-supervised representation learning [0.0]
A backdoor or Trojan attack is an important type of data poisoning attack against deep neural network (DNN) We show that supervised training may build stronger association between the backdoor pattern and the associated target class than that between normal features and the true class of origin. We propose to use unsupervised representation learning to avoid emphasising backdoor-poisoned training samples and learn a similar feature embedding for samples of the same class.
arXiv Detail & Related papers (2022-10-19T03:29:58Z)
Black-box Detection of Backdoor Attacks with Limited Information and Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model. In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.