Related papers: Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

URL: http://arxiv.org/abs/2505.23015v1
Date: Thu, 29 May 2025 02:49:29 GMT
Title: Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
Authors: Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng,
Abstract summary: We propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms.<n>Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance.
Score: 12.519879298717104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

Related papers

BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models [12.42597979026873]
We propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy.
arXiv Detail & Related papers (2023-12-18T09:40:38Z)
Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks [35.42528584450334]
We develop an innovative poisoned sample detection approach, called Activation Gradient based Poisoned sample Detection (AGPD) First, we calculate GCDs of all classes from the model trained on the untrustworthy dataset. Then, we identify the target class(es) based on the difference on GCD dispersion between target and clean classes. Last, we filter out poisoned samples within the identified target class(es) based on the clear separation between poisoned and clean samples.
arXiv Detail & Related papers (2023-12-11T09:17:33Z)
Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models [72.07462371883501]
We propose emphProjection Regret (PR), an efficient novelty detection method that mitigates the bias of non-semantic information. PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.
arXiv Detail & Related papers (2023-12-05T09:44:47Z)
VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models [46.72546879204724]
In the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains. We propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.
arXiv Detail & Related papers (2023-09-28T07:37:18Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency [33.42013309686333]
We propose a test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Extensive experiments demonstrate that compared with state-of-the-art defenses, TeCo outperforms them on different backdoor attacks, datasets, and model architectures.
arXiv Detail & Related papers (2023-03-27T07:10:37Z)
Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling [77.15766509677348]
Conditional generative models often inherit spurious correlations from the training dataset. This can result in label-conditional distributions that are imbalanced with respect to another latent attribute. We propose a general two-step strategy to mitigate this issue.
arXiv Detail & Related papers (2022-12-05T08:09:33Z)
Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders [51.691585766702744]
We propose a variant of Adversarial Autoencoder which uses a mirrored Wasserstein loss in the discriminator to enforce better semantic-level reconstruction. We put forward an alternative measure of anomaly score to replace the reconstruction-based metric. Our method outperforms the current state-of-the-art methods for anomaly detection on several OOD detection benchmarks.
arXiv Detail & Related papers (2020-03-24T08:26:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.