Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
- URL: http://arxiv.org/abs/2406.02619v2
- Date: Sat, 01 Feb 2025 08:22:37 GMT
- Title: Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
- Authors: Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt,
- Abstract summary: We introduce a novel class of backdoors in transformer models, that are unelicitable in nature.
Unelicitability prevents the defender from triggering the backdoor, making it impossible to properly evaluate ahead of deployment.
We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties.
- Score: 1.1118610055902116
- License:
- Abstract: The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to properly evaluate ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.
Related papers
- A Backdoor Attack Scheme with Invisible Triggers Based on Model Architecture Modification [12.393139669821869]
Traditional backdoor attacks involve injecting malicious samples with specific triggers into the training data.
More sophisticated attacks modify the model's architecture directly.
A novel backdoor attack method is presented in the paper.
It embeds the backdoor within the model's architecture and has the capability to generate inconspicuous and stealthy triggers.
arXiv Detail & Related papers (2024-12-22T07:39:43Z) - Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models [39.34881774508323]
We investigate the threat posed by undetectable backdoors in ML models developed by external expert firms.
We develop a strategy to plant backdoors to obfuscated neural networks, that satisfy the security properties of the celebrated notion of indistinguishability obfuscation.
Our method to plant backdoors ensures that even if the weights and architecture of the obfuscated model are accessible, the existence of the backdoor is still undetectable.
arXiv Detail & Related papers (2024-06-09T06:26:21Z) - Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery
Detection [62.595450266262645]
This paper introduces a novel and previously unrecognized threat in face forgery detection scenarios caused by backdoor attack.
By embedding backdoors into models, attackers can deceive detectors into producing erroneous predictions for forged faces.
We propose emphPoisoned Forgery Face framework, which enables clean-label backdoor attacks on face forgery detectors.
arXiv Detail & Related papers (2024-02-18T06:31:05Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - An anomaly detection approach for backdoored neural networks: face
recognition as a case study [77.92020418343022]
We propose a novel backdoored network detection method based on the principle of anomaly detection.
We test our method on a novel dataset of backdoored networks and report detectability results with perfect scores.
arXiv Detail & Related papers (2022-08-22T12:14:13Z) - Architectural Backdoors in Neural Networks [27.315196801989032]
We introduce a new class of backdoor attacks that hide inside model architectures.
These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture.
We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch.
arXiv Detail & Related papers (2022-06-15T22:44:03Z) - Check Your Other Door! Establishing Backdoor Attacks in the Frequency
Domain [80.24811082454367]
We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks.
We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them.
arXiv Detail & Related papers (2021-09-12T12:44:52Z) - Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word
Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks.
Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated.
We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - WaNet -- Imperceptible Warping-based Backdoor Attack [20.289889150949836]
A third-party model can be poisoned in training to work well in normal conditions but behave maliciously when a trigger pattern appears.
In this paper, we propose using warping-based triggers to attack third-party models.
The proposed backdoor outperforms the previous methods in a human inspection test by a wide margin, proving its stealthiness.
arXiv Detail & Related papers (2021-02-20T15:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.