Related papers: Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks

Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks

URL: http://arxiv.org/abs/2312.06230v2
Date: Tue, 28 May 2024 03:36:40 GMT
Title: Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks
Authors: Danni Yuan, Shaokui Wei, Mingda Zhang, Li Liu, Baoyuan Wu,
Abstract summary: We develop an innovative poisoned sample detection approach, called Activation Gradient based Poisoned sample Detection (AGPD) First, we calculate GCDs of all classes from the model trained on the untrustworthy dataset. Then, we identify the target class(es) based on the difference on GCD dispersion between target and clean classes. Last, we filter out poisoned samples within the identified target class(es) based on the clear separation between poisoned and clean samples.
Score: 35.42528584450334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work studies the task of poisoned sample detection for defending against data poisoning based backdoor attacks. Its core challenge is finding a generalizable and discriminative metric to distinguish between clean and various types of poisoned samples (e.g., various triggers, various poisoning ratios). Inspired by a common phenomenon in backdoor attacks that the backdoored model tend to map significantly different poisoned and clean samples within the target class to similar activation areas, we introduce a novel perspective of the circular distribution of the gradients w.r.t. sample activation, dubbed gradient circular distribution (GCD). And, we find two interesting observations based on GCD. One is that the GCD of samples in the target class is much more dispersed than that in the clean class. The other is that in the GCD of target class, poisoned and clean samples are clearly separated. Inspired by above two observations, we develop an innovative three-stage poisoned sample detection approach, called Activation Gradient based Poisoned sample Detection (AGPD). First, we calculate GCDs of all classes from the model trained on the untrustworthy dataset. Then, we identify the target class(es) based on the difference on GCD dispersion between target and clean classes. Last, we filter out poisoned samples within the identified target class(es) based on the clear separation between poisoned and clean samples. Extensive experiments under various settings of backdoor attacks demonstrate the superior detection performance of the proposed method to existing poisoned detection approaches according to sample activation-based metrics.

Related papers

BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models [12.519879298717104]
We propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms.<n>Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance.
arXiv Detail & Related papers (2025-05-29T02:49:29Z)
Test-Time Backdoor Detection for Object Detection Models [14.69149115853361]
Object detection models are vulnerable to backdoor attacks. Transformation Consistency Evaluation (TRACE) is a brand-new method for detecting poisoned samples at test time in object detection. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses.
arXiv Detail & Related papers (2025-03-19T15:12:26Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information [75.36597470578724]
Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks. We propose gUided Purification (COUP) algorithm, which purifies while keeping away from the classifier decision boundary. Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.
arXiv Detail & Related papers (2024-08-12T02:48:00Z)
CBPF: Filtering Poisoned Data Based on Composite Backdoor Attack [11.815603563125654]
This paper explores strategies for mitigating the risks associated with backdoor attacks by examining the filtration of poisoned samples. A novel three-stage poisoning data filtering approach, known as Composite Backdoor Poison Filtering (CBPF), is proposed as an effective solution.
arXiv Detail & Related papers (2024-06-23T14:37:24Z)
Model X-ray:Detecting Backdoored Models via Decision Boundary [62.675297418960355]
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs) We propose Model X-ray, a novel backdoor detection approach based on the analysis of illustrated two-dimensional (2D) decision boundaries. Our approach includes two strategies focused on the decision areas dominated by clean samples and the concentration of label distribution.
arXiv Detail & Related papers (2024-02-27T12:42:07Z)
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models [12.42597979026873]
We propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy.
arXiv Detail & Related papers (2023-12-18T09:40:38Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
Don't FREAK Out: A Frequency-Inspired Approach to Detecting Backdoor Poisoned Samples in DNNs [130.965542948104]
In this paper, we investigate the frequency sensitivity of Deep Neural Networks (DNNs) when presented with clean samples versus poisoned samples. We propose a frequency-based poisoned sample detection algorithm that is simple yet effective.
arXiv Detail & Related papers (2023-03-23T12:11:24Z)
DeepPoison: Feature Transfer Based Stealthy Poisoning Attack [2.1445455835823624]
DeepPoison is a novel adversarial network of one generator and two discriminators. DeepPoison can achieve a state-of-the-art attack success rate, as high as 91.74%.
arXiv Detail & Related papers (2021-01-06T15:45:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.