Related papers: Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

URL: http://arxiv.org/abs/2602.10780v1
Date: Wed, 11 Feb 2026 12:13:25 GMT
Title: Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks
Authors: Enrico Ahlers, Daniel Passon, Yannic Noller, Lars Grunske,
Abstract summary: A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process.<n>We propose our inference-time backdoor mitigation approach called FIRE.<n>We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism.
Score: 1.9517610560768623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

Related papers

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z)
Backdoor Unlearning by Linear Task Decomposition [69.91984435094157]
Foundation models are highly susceptible to adversarial perturbations and targeted backdoor attacks.<n>Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior.<n>This raises the question of whether backdoors can be removed without compromising the general capabilities of the models.
arXiv Detail & Related papers (2025-10-16T16:18:07Z)
Backdoor Mitigation by Correcting the Distribution of Neural Activations [30.554700057079867]
Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs) We analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances. We propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration.
arXiv Detail & Related papers (2023-08-18T22:52:29Z)
Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks. backdoor attack is an emerging yet threatening training-phase threat. We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification. CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z)
Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics. We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
Backdoor Learning Curves: Explaining Backdoor Poisoning Beyond Influence Functions [23.750285504961337]
We study the process of backdoor learning under the lens of incremental learning and influence functions.<n>We show that the effectiveness of backdoor attacks depends on: (i) the complexity of the learning algorithm; (ii) the fraction of backdoor samples injected into the training set; and (iii) the size and visibility of the backdoor trigger.
arXiv Detail & Related papers (2021-06-14T08:00:48Z)
Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch. We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types. In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.