Detecting Semantic Backdoors in a Mystery Shopping Scenario
- URL: http://arxiv.org/abs/2601.03805v1
- Date: Wed, 07 Jan 2026 11:04:04 GMT
- Title: Detecting Semantic Backdoors in a Mystery Shopping Scenario
- Authors: Arpad Berta, Gabor Danner, Istvan Hegedus, Mark Jelasity,
- Abstract summary: We tackle the problem of detecting semantic backdoors in classification models.<n>Under the assumption that the clean training dataset and the training recipe of the model are both known, we propose a reference model pool.<n>We experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting semantic backdoors in classification models--where some classes can be activated by certain natural, but out-of-distribution inputs--is an important problem that has received relatively little attention. Semantic backdoors are significantly harder to detect than backdoors that are based on trigger patterns due to the lack of such clearly identifiable patterns. We tackle this problem under the assumption that the clean training dataset and the training recipe of the model are both known. These assumptions are motivated by a consumer protection scenario, in which the responsible authority performs mystery shopping to test a machine learning service provider. In this scenario, the authority uses the provider's resources and tools to train a model on a given dataset and tests whether the provider included a backdoor. In our proposed approach, the authority creates a reference model pool by training a small number of clean and poisoned models using trusted infrastructure, and calibrates a model distance threshold to identify clean models. We propose and experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection. The most reliable method is based on requesting adversarial training from the provider. The model distance is best measured using a set of input samples generated by inverting the models in such a way as to maximize the distance from clean samples. With these settings, our method can often completely separate clean and poisoned models, and it proves to be superior to state-of-the-art backdoor detectors as well.
Related papers
- Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z) - Solving Trojan Detection Competitions with Linear Weight Classification [1.24275433420322]
We introduce a detector that works remarkably well across many of the existing datasets and domains.
We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains.
arXiv Detail & Related papers (2024-11-05T19:00:34Z) - Towards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis [5.8634235309501435]
We propose a backdoor defense framework tailored to object detection models.
By quantifying and analyzing inconsistencies, we develop an algorithm to detect backdoors.
Experiments with state-of-the-art two-stage object detectors show our method achieves a 90% improvement in backdoor removal rate.
arXiv Detail & Related papers (2024-09-24T12:58:35Z) - Backdoor Defense through Self-Supervised and Generative Learning [0.0]
Training on such data injects a backdoor which causes malicious inference in selected test samples.
This paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space.
In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset.
arXiv Detail & Related papers (2024-09-02T11:40:01Z) - Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors.
We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures.
This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z) - Are You Stealing My Model? Sample Correlation for Fingerprinting Deep
Neural Networks [86.55317144826179]
Previous methods always leverage the transferable adversarial examples as the model fingerprint.
We propose a novel yet simple model stealing detection method based on SAmple Correlation (SAC)
SAC successfully defends against various model stealing attacks, even including adversarial training or transfer learning.
arXiv Detail & Related papers (2022-10-21T02:07:50Z) - CrowdGuard: Federated Backdoor Detection in Federated Learning [39.58317527488534]
This paper presents a novel defense mechanism, CrowdGuard, that effectively mitigates backdoor attacks in Federated Learning.
CrowdGuard employs a server-located stacked clustering scheme to enhance its resilience to rogue client feedback.
The evaluation results demonstrate that CrowdGuard achieves a 100% True-Positive-Rate and True-Negative-Rate across various scenarios.
arXiv Detail & Related papers (2022-10-14T11:27:49Z) - MOVE: Effective and Harmless Ownership Verification via Embedded External Features [104.97541464349581]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.<n>We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.<n>We then train a meta-classifier to determine whether a model is stolen from the victim.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Defending against Model Stealing via Verifying Embedded External
Features [90.29429679125508]
adversaries can steal' deployed models even when they have no training samples and can not get access to the model parameters or structures.
We explore the defense from another angle by verifying whether a suspicious model contains the knowledge of defender-specified emphexternal features.
Our method is effective in detecting different types of model stealing simultaneously, even if the stolen model is obtained via a multi-stage stealing process.
arXiv Detail & Related papers (2021-12-07T03:51:54Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.